AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
52
Papers today
8h
Update frequency
7
Days of history
DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting
Time Series
- DySCo addresses the paradox of increasing lookback windows in time series forecasting by reducing noise and redundancy.
- The framework incorporates EGDS for dynamic sampling based on entropy, preserving valuable information.
- HFED allows for multi-granularity modeling by separating high-frequency anomalies from low-frequency patterns.
- CSIM enhances prediction accuracy by dynamically fusing global and local information.
Read more
DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting
Summary
The paper presents the Dynamic Semantic Compression (DySCo) framework, aimed at improving long-term time series forecasting (TSF) by addressing the challenges of noise and redundancy in extended lookback windows. Traditional methods often fail to enhance predictive accuracy when increasing the input length due to irrelevant noise accumulation. DySCo introduces an Entropy-Guided Dynamic Sampling (EGDS) mechanism that autonomously identifies and retains high-entropy segments while compressing redundant trends. Additionally, it employs a Hierarchical Frequency-Enhanced Decomposition (HFED) strategy to separate high-frequency anomalies from low-frequency patterns, ensuring critical details are preserved. The Cross-Scale Interaction Mixer (CSIM) is designed to dynamically fuse global contexts with local representations, enhancing the model's ability to capture long-term correlations. Experimental results indicate that DySCo can be integrated as a plug-and-play module into mainstream TSF models, significantly improving their performance without incurring excessive computational costs.
Methodology
The DySCo framework consists of three main components: Entropy-Guided Dynamic Sampling (EGDS) for adaptive information retention, Hierarchical Frequency-Enhanced Decomposition (HFED) for multi-granularity representation, and Cross-Scale Interaction Mixer (CSIM) for context-aware aggregation of predictions. These components work together to optimize the balance between capturing long-term dependencies and minimizing noise in time series data.
Results
The integration of DySCo into mainstream time series forecasting models resulted in significant enhancements in predictive accuracy, particularly in capturing long-term correlations, while also reducing computational costs. The experimental evaluations demonstrated that DySCo serves as an effective plug-and-play module, improving the performance of existing models.
Implications
The DySCo framework has potential applications across various domains requiring long-term time series forecasting, such as finance, meteorology, and healthcare. Its ability to effectively manage noise and redundancy could lead to more accurate predictions and better decision-making in operational contexts.
AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
NLP
Large Language Models
Efficient ML
- AA-SVD allows for rapid compression of billion-parameter models without retraining.
- The method addresses distribution shifts caused by upstream compression, improving accuracy.
- AA-SVD minimizes block-level output distortion by refining all compressed layers jointly.
- Experimental results show significant performance improvements over existing SVD-based baselines.
Read more
AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
Summary
The paper presents AA-SVD, a novel framework for compressing large language models (LLMs) using low-rank factorization without the need for retraining. Unlike existing methods that either focus solely on original inputs or shifted inputs, AA-SVD effectively combines both approaches to minimize output distortion across transformer blocks. By anchoring compressed layers to the original model outputs and modeling input distribution shifts, AA-SVD achieves a low-rank approximation that maintains functional equivalence with the original model. The authors demonstrate that AA-SVD consistently outperforms existing SVD-based methods, particularly at aggressive compression ratios, making it a practical solution for deploying large-scale models efficiently.
Methodology
AA-SVD employs a low-rank factorization framework that anchors compressed layers to the original model outputs while accounting for input distribution shifts. This dual consideration allows for a more accurate low-rank approximation that preserves the functional behavior of the original model. The method refines all layers within a transformer block jointly to minimize cumulative errors.
Results
Experiments conducted on large language models indicate that AA-SVD consistently outperforms traditional SVD-based compression techniques across various compression ratios. The advantages of AA-SVD become more pronounced at higher compression levels, where competing methods often degrade in performance or fail entirely.
Implications
The AA-SVD framework offers a promising approach for efficiently deploying large language models in resource-constrained environments, enabling broader accessibility and application of advanced NLP technologies. Its ability to maintain model fidelity during compression could facilitate the use of LLMs in real-time applications and on devices with limited computational resources.
Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial
Optimization
- Bayesian Optimization formalizes the scientific discovery process, reducing reliance on trial-and-error.
- Surrogate models and acquisition functions are crucial for guiding experimental design and decision-making.
- The tutorial provides practical coding examples and theoretical insights tailored for various audiences.
- Real-world case studies demonstrate the effectiveness of BO in diverse scientific fields.
Read more
Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial
Summary
This tutorial presents Bayesian Optimization (BO) as a structured approach to scientific discovery, addressing inefficiencies in traditional experimental design. The authors argue that scientific discovery can be framed as optimization problems, where BO serves to formalize and automate the process. The tutorial covers the foundational concepts of BO, including surrogate models, Gaussian processes, and acquisition functions, which guide experimental selection and balance exploration with exploitation. Through real-world case studies in catalysis, materials science, and organic synthesis, the efficacy of BO is validated. The tutorial also discusses technical extensions for practical applications, such as batched experimentation and integrating human input, making it accessible to a broad scientific audience. It aims to empower researchers to design experiments more efficiently, leading to accelerated scientific discovery.
Methodology
The paper outlines Bayesian Optimization as a method that employs surrogate models (like Gaussian processes) to model unknown mechanisms and acquisition functions to select experiments. It emphasizes the optimization framework for scientific discovery and provides algorithmic workflows alongside coding examples for practical implementation.
Results
The tutorial validates the effectiveness of Bayesian Optimization through case studies, showing improved efficiency in experimental design and discovery processes across various scientific domains. It highlights the balance between exploration and exploitation as a key factor in successful outcomes.
Implications
The findings suggest that Bayesian Optimization can significantly enhance the efficiency and effectiveness of scientific research, leading to faster and more reliable discoveries. This approach can be applied across various fields, including chemistry, materials science, and biology, potentially transforming experimental methodologies.
Soft MPCritic: Amortized Model Predictive Value Iteration
Reinforcement Learning
Robotics
Optimization
- Introduces a hybrid RL-MPC framework that operates entirely in value space.
- Utilizes model predictive path integral control (MPPI) for online control and value target generation.
- Implements an amortized warm-start strategy to enhance computational efficiency.
- Demonstrates effectiveness on classic and complex control tasks.
Read more
Soft MPCritic: Amortized Model Predictive Value Iteration
Summary
The paper presents 'soft MPCritic', a novel framework that integrates reinforcement learning (RL) and model predictive control (MPC) to address computational challenges in large-scale applications. By operating in the value space, soft MPCritic employs model predictive path integral control (MPPI) for online control and value target generation. The framework trains a terminal Q-function using fitted value iteration, aligning the learned value function with the planner and effectively extending the planning horizon. A key innovation is the amortized warm-start strategy, which recycles planned action sequences from online observations to compute batched MPPI-based value targets, enhancing computational efficiency while maintaining solution quality. The authors demonstrate the effectiveness of soft MPCritic through case studies on various control tasks, showcasing its ability to learn robustly via short-horizon planning, thus establishing it as a scalable solution for synthesizing MPC policies in complex environments.
Methodology
The methodology involves a hybrid approach that combines reinforcement learning and model predictive control. The framework uses MPPI for planning and fitted value iteration for training a terminal Q-function. It incorporates an amortized warm-start strategy to efficiently generate value targets by reusing planned action sequences from online observations.
Results
The results indicate that soft MPCritic effectively learns through robust, short-horizon planning, achieving high-quality control solutions across various challenging tasks. The framework demonstrates improved computational practicality while preserving the quality of solutions compared to traditional methods.
Implications
The implications of this research suggest that soft MPCritic can be applied in scenarios where traditional policy extraction and long-horizon planning methods may fail, making it a valuable tool for real-time decision-making in complex environments.
Massively Parallel Exact Inference for Hawkes Processes
Time Series
Efficient ML
Theory
- Introduces a massively parallel algorithm for maximum likelihood estimation of linear exponential Hawkes processes.
- Reduces computational complexity from O(N^2) to approximately O(N/P) with P parallel processors.
- Utilizes a parallel prefix scan for efficient computation of event intensities.
- Maintains exact likelihood computation without additional assumptions, preserving model interpretability.
Read more
Massively Parallel Exact Inference for Hawkes Processes
Summary
This paper presents a novel algorithm for maximum likelihood estimation of multivariate Hawkes processes, which are self-exciting point processes widely used in various fields such as finance, seismology, and social media. Traditional methods for estimating the parameters of these processes scale quadratically with the number of events, making them impractical for large datasets. The authors propose a massively parallel approach that leverages the structure of the linear exponential Hawkes process to achieve linear-time complexity with respect to the number of events when using multiple processors. By expressing the intensity function as a product of sparse transition matrices, the authors utilize a parallel prefix scan algorithm to compute per-event intensities efficiently. This method not only maintains the exact likelihood without additional assumptions but also circumvents GPU memory constraints through a batching scheme. The authors demonstrate significant speed improvements on both simulated and real datasets, achieving scalability to thousands of nodes and tens of millions of events. An open-source PyTorch library implementing these optimizations is also provided.
Methodology
The authors reformulate the intensity function of the linear exponential Hawkes process using sparse transition matrices, allowing for the application of a parallel prefix scan algorithm. This enables the computation of event intensities in a parallelized manner, significantly reducing the time complexity associated with maximum likelihood estimation.
Results
The proposed method achieves orders-of-magnitude speed improvements in fitting times compared to existing implementations, successfully scaling to datasets with thousands of nodes and tens of millions of events. The authors provide empirical evidence of the algorithm's efficiency on both simulated and real-world datasets.
Implications
This work has significant implications for fields that utilize Hawkes processes, enabling researchers and practitioners to analyze larger datasets more efficiently. The open-source library facilitates broader access to these advanced computational techniques, potentially enhancing research in finance, seismology, and social media analytics.
Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
NLP
Large Language Models
Generative Models
- EC routing outperforms TC routing in DLMs, achieving better load balance and faster convergence.
- Timestep-dependent expert capacity scheduling enhances performance by allocating more resources to low-mask-ratio steps.
- Existing pretrained TC DLMs can be retrofitted to EC routing with significant improvements in accuracy and convergence speed.
- The study provides a mechanistic explanation for the efficiency gains observed with EC routing.
Read more
Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
Summary
This paper introduces Expert-Choice (EC) routing as a superior alternative to Token-Choice (TC) routing in Diffusion Language Models (DLMs). The authors argue that TC routing, inherited from autoregressive models, leads to load imbalance and inefficient computation allocation. EC routing, in contrast, ensures deterministic load balancing and allows for timestep-dependent expert capacity, optimizing resource allocation based on the denoising step. The study demonstrates that allocating more capacity to low-mask-ratio steps significantly enhances performance due to higher learning efficiency in these contexts. Furthermore, the authors show that existing pretrained TC DLMs can be retrofitted to utilize EC routing, resulting in improved convergence speed and accuracy across various downstream tasks. Overall, the findings establish EC routing as a more effective paradigm for DLM MoE models, advocating for a shift towards adaptive computation policies in DLMs.
Methodology
The authors conducted a systematic comparison between EC and TC routing in DLMs, analyzing load balancing, throughput, and convergence rates. They introduced a timestep-dependent expert capacity mechanism and evaluated its impact on performance under matched FLOPs. The study also involved retrofitting existing TC DLMs to EC routing and assessing the improvements in downstream task performance.
Results
The results indicate that EC routing achieves a 2.0Γ faster convergence compared to TC routing, with significant improvements in load balancing and throughput. The analysis revealed that low-mask-ratio contexts benefit from increased expert capacity, leading to enhanced learning efficiency. Retrofitting TC DLMs to EC resulted in faster convergence and improved accuracy across diverse tasks.
Implications
The findings suggest that adopting EC routing could lead to more efficient training and inference in large-scale DLMs, potentially influencing future designs of language models and their applications in various NLP tasks. This approach may also inspire new adaptive computation strategies in other areas of machine learning.
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Multimodal
Graph Learning
Computer Vision
- CRIT introduces a new dataset specifically designed for cross-modal multi-hop reasoning.
- The dataset is generated using a graph-based automatic data synthesis pipeline.
- State-of-the-art VLMs show significant performance improvements when trained on CRIT.
- Existing multimodal benchmarks often fail to test true cross-modal grounding.
Read more
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Summary
The paper introduces CRIT, a novel dataset and benchmark aimed at enhancing cross-modal multi-hop reasoning in Vision-Language Models (VLMs). The authors identify a significant gap in existing multimodal benchmarks, which often fail to require true cross-modal grounding, leading to hallucinations and poorly grounded reasoning in VLMs. CRIT addresses this by employing a graph-based automatic data synthesis pipeline that generates complex reasoning tasks from interleaved image-text contexts. The dataset spans various domains, including natural images and videos, and includes a manually verified test set for reliable evaluation. Experimental results demonstrate that even state-of-the-art models struggle with the reasoning tasks presented in CRIT, but models trained on this dataset show marked improvements in cross-modal reasoning capabilities, outperforming existing benchmarks such as SPIQA. This work highlights the need for more sophisticated multimodal datasets that enforce the integration of information across modalities to foster better reasoning in AI systems.
Methodology
The authors developed a graph-based automatic data generation pipeline that captures entities, attributes, and relationships from interleaved image-text content. This structured representation allows for the sampling of sub-graphs that ensure complex multi-hop relationships are present. A model is then used to generate questions that require multi-hop reasoning, without relying on VLMs, thus avoiding cyclical biases in data generation.
Results
Experiments reveal that state-of-the-art models struggle with the reasoning tasks in the CRIT dataset. However, models trained on CRIT demonstrate significant gains in cross-modal multi-hop reasoning, achieving strong improvements on SPIQA and other standard multimodal benchmarks.
Implications
The findings suggest that enhancing datasets with complex, interleaved image-text contexts can lead to better performance in VLMs. This has implications for developing more robust AI systems capable of performing real-world reasoning tasks that require the integration of multiple modalities.
MATA-Former & SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
Time Series
Multimodal
- Introduction of MATA-Former, a transformer architecture that integrates semantic awareness into temporal attention for ICU risk prediction.
- Development of Plateau-Gaussian Soft Labeling (PSL) to transition from binary classification to continuous regression, enhancing risk modeling granularity.
- Creation of the SIICU dataset, featuring over 506,000 expert-annotated clinical events, addressing the lack of fine-grained clinical datasets.
- Demonstration of improved efficacy and generalization in risk prediction using the proposed methods on both SIICU and MIMIC-IV datasets.
Read more
MATA-Former & SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
Summary
This paper presents a novel framework, MATA-Former, designed to enhance ICU risk prediction by addressing the challenges of integrating structured and unstructured clinical data. Traditional methods often rely on binary supervision and physical timestamps, which can misalign with clinical realities where risk signals are embedded in complex medical semantics. MATA-Former employs a semantic-aware temporal attention mechanism that dynamically adjusts attention weights based on the causal relevance of historical events rather than their temporal proximity. Additionally, the authors introduce Plateau-Gaussian Soft Labeling (PSL), which reformulates binary classification into a continuous multi-horizon regression framework, allowing for a more nuanced understanding of risk evolution over time. The framework is evaluated on the newly constructed Semantic-Integrated Intensive Care Unit (SIICU) dataset, which includes over 506,000 expert-annotated clinical events, demonstrating superior performance in capturing risks from irregular, text-intensive clinical time series compared to existing methods.
Methodology
The authors propose MATA-Former, which utilizes a semantic-guided attention mechanism to prioritize clinically relevant historical events. This is complemented by PSL, which transforms binary labels into continuous regression outputs, allowing for a more detailed representation of risk trajectories. The SIICU dataset was constructed using a combination of large language model pre-annotation followed by expert verification to ensure high-quality annotations.
Results
The MATA-Former framework outperformed existing methods in terms of risk prediction accuracy and generalization capabilities on both the SIICU and MIMIC-IV datasets. The introduction of PSL significantly improved the model's ability to capture the dynamics of risk evolution over time.
Implications
The findings suggest that integrating semantic awareness into predictive modeling can enhance clinical decision support systems, potentially leading to better patient outcomes in critical care settings. The SIICU dataset can serve as a valuable resource for future research in ICU risk prediction.
Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Virtual Sensing on Irregular Grids
Graph Learning
Efficient ML
Theory
- VIRSO offers a novel approach to virtual sensing by combining spectral and spatial analysis for accurate reconstruction.
- The method significantly reduces energy consumption and latency, making it suitable for edge deployment.
- VIRSO outperforms existing neural operators in terms of accuracy and efficiency on complex benchmarks.
- The introduction of Variable KNN (V-KNN) enhances graph construction for irregular geometries.
Read more
Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Virtual Sensing on Irregular Grids
Summary
This paper introduces VIRSO (Virtual Irregular Real-Time Sparse Operator), a novel graph-based neural operator designed for sparse-to-dense reconstruction on irregular geometries, addressing the challenges of real-time virtual sensing in resource-constrained environments. Traditional physics-based solvers are often too slow and power-hungry for practical applications, particularly in fields like nuclear thermal-hydraulics where accurate internal state monitoring is critical. VIRSO integrates both spectral and spatial analysis to enhance reconstruction accuracy while minimizing latency and power consumption. The authors propose a variable-connectivity algorithm, Variable KNN (V-KNN), for efficient graph construction tailored to mesh-informed data. Evaluations on three nuclear thermal-hydraulic benchmarks demonstrate that VIRSO achieves mean relative L2 errors below 1% across various reconstruction ratios, outperforming existing methods with fewer parameters. The implementation on NVIDIA Jetson Orin Nano confirms its edge-deployability, achieving sub-10 W power consumption and sub-second latency, thus establishing a new paradigm for compute-aware operator learning in real-time sensing applications.
Methodology
The authors developed a graph-based neural operator, VIRSO, which employs a variable-connectivity algorithm (V-KNN) for mesh-informed graph construction. The operator is designed to efficiently map sparse boundary measurements to complete interior field distributions, optimizing for edge hardware constraints such as latency, power, and memory.
Results
VIRSO achieved mean relative L2 errors below 1% across three nuclear thermal-hydraulic benchmarks with reconstruction ratios ranging from 47:1 to 156:1. The full 10-layer configuration reduced the energy-delay product from approximately 206 JΒ·ms to 10.1 JΒ·ms on an NVIDIA H200. Implemented on an NVIDIA Jetson Orin Nano, all configurations maintained sub-10 W power consumption and sub-second latency.
Implications
The findings suggest that VIRSO can serve as a viable solution for real-time virtual sensing in inaccessible and resource-constrained environments, particularly in advanced nuclear energy systems. This work opens avenues for further research into compute-aware operator learning frameworks that can be applied across various fields requiring efficient and accurate sensing.
Improving Latent Generalization Using Test-time Compute
NLP
Large Language Models
Reinforcement Learning
- Introduces test-time compute as a method to improve latent generalization in LLMs.
- Demonstrates that models trained to generate chains-of-thought can generalize effectively to both in-distribution and out-of-distribution tasks.
- Finds that while thinking improves performance on many tasks, pure reversal tasks remain challenging.
- Highlights the brittleness of factual self-verification in thinking models compared to in-context learning.
Read more
Improving Latent Generalization Using Test-time Compute
Summary
This paper addresses the limitations of in-weights learning in large language models (LLMs), particularly their struggles with latent generalization, which refers to the ability to deduce knowledge that is not explicitly stated in the training data. The authors identify that while in-context learning (ICL) demonstrates robust generalization capabilities, in-weights learning often fails in tasks requiring deductive reasoning, exemplified by the reversal curse phenomenon. Previous methods to enhance latent generalization relied on train-time data augmentation, which are task-specific and do not generalize well to out-of-distribution knowledge. To overcome these limitations, the authors propose a novel approach that leverages test-time compute, or 'thinking', to improve latent generalization. They employ Reinforcement Learning (RL) from correctness feedback to train models to generate long chains-of-thought (CoTs). The experiments reveal that this thinking approach significantly enhances latent generalization, allowing models to perform better on both in-distribution and out-of-distribution tasks. However, the authors note that while thinking models show improved performance on many deductive reasoning tasks, they still struggle with pure reversal tasks, indicating that factual self-verification remains a challenge. Overall, the study presents test-time thinking as a promising direction for advancing latent generalization in LLMs.
Methodology
The authors utilize Reinforcement Learning (RL) to train large language models to produce long chains-of-thought (CoTs) during test time. This approach allows the models to probe their in-weights knowledge effectively, improving their ability to generalize beyond the training data. They replicate the lack of latent generalization in LLMs on various datasets and then assess the performance of models trained with the thinking methodology against traditional train-time augmentation techniques.
Results
The experiments show that thinking models significantly outperform traditional train-time augmented models on latent generalization tasks. They effectively generate long CoTs that enhance reasoning capabilities, leading to improved performance on in-distribution tasks and some out-of-distribution tasks. However, the models still exhibit challenges with pure reversal tasks, indicating that while thinking aids generalization, it does not fully resolve all issues related to factual self-verification.
Implications
The findings suggest that incorporating test-time reasoning strategies could lead to more flexible and robust language models capable of better generalization across various tasks. This approach may have applications in areas requiring advanced reasoning and deduction, such as automated reasoning systems, educational tools, and complex decision-making processes.
Variational LSTM with Augmented Inputs: Nonlinear Response History Metamodeling with Aleatoric and Epistemic Uncertainty
Time Series
- Introduces a Variational LSTM model for nonlinear structural metamodeling.
- Augmented inputs effectively capture record-to-record variability and system uncertainties.
- Epistemic uncertainty is quantified using Monte Carlo dropout, enhancing prediction reliability.
- Validated on nonlinear systems subjected to stochastic seismic and wind loads.
Read more
Variational LSTM with Augmented Inputs: Nonlinear Response History Metamodeling with Aleatoric and Epistemic Uncertainty
Summary
This paper presents a novel approach for nonlinear structural metamodeling using a Variational Long Short-Term Memory (LSTM) model with augmented inputs. The proposed method addresses the challenges of uncertainty propagation in high-dimensional nonlinear dynamic systems, particularly in the context of performance-based design and risk assessment. The authors introduce augmented inputs that capture both aleatoric uncertainty (variability due to inherent randomness in the system) and epistemic uncertainty (uncertainty in model predictions). The epistemic uncertainty is quantified using a Monte Carlo dropout technique, which allows for effective uncertainty estimation without the computational burden of full Bayesian methods. The model is validated through case studies involving stochastic seismic and wind loads, demonstrating its ability to accurately reproduce nonlinear response time histories while providing confidence bounds that reflect epistemic uncertainty. This approach significantly enhances the predictive fidelity of machine learning models in structural engineering applications, making them more reliable for practical use.
Methodology
The authors developed a probabilistic metamodeling technique based on a Variational LSTM architecture. The model incorporates augmented inputs that include key random system parameters and excitation series to capture aleatoric uncertainty. Monte Carlo dropout is employed to approximate epistemic uncertainty, allowing for efficient uncertainty simulation without significant additional training costs.
Results
The proposed metamodels demonstrated high accuracy in reproducing nonlinear response time histories across various case studies. The confidence bounds generated by the model effectively indicated the associated epistemic uncertainty, showcasing the model's robustness and predictive fidelity under different stochastic loading scenarios.
Implications
This research has significant implications for structural engineering, particularly in performance-based design and risk assessment. The ability to quantify both aleatoric and epistemic uncertainties enhances the reliability of predictions made by machine learning models, facilitating better decision-making in engineering applications. The methodology can be applied to various high-dimensional dynamic systems, potentially transforming how uncertainty is managed in structural analysis.
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Large Language Models
Efficient ML
NLP
- OUROBOROS introduces a Controller hypernetwork for dynamic weight modulation in recursive transformers.
- The system achieves a 43.4% reduction in training loss compared to a baseline model.
- Gated recurrence is essential for maintaining performance during deep iterations.
- The model outperforms static per-step LoRA methods, particularly at lower depths.
Read more
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Summary
The paper introduces OUROBOROS, a novel system designed to enhance recursive transformers by addressing the limitation of uniform weight application across multiple depth steps. Traditional recursive transformers apply the same transformation at each step, which restricts their ability to perform distinct operations. OUROBOROS incorporates a compact Controller hypernetwork that generates input-dependent diagonal modulation vectors for each step, applied to frozen low-rank adaptation (LoRA) bases. This approach allows the model to adapt its transformations based on the current hidden state, significantly improving its performance. The system combines gated recurrence, initialized to 88% retention, and per-step LayerNorm to ensure stable deep iterations. Empirical results demonstrate that OUROBOROS reduces training loss by 43.4% compared to a baseline model while recovering 51.3% of the performance gap caused by layer removal. The model adds only 9.2 million trainable parameters and consistently outperforms static per-step LoRA across various depths and ranks. However, the Controller's improvements do not yet transfer to held-out text, indicating a limitation attributed to frozen downstream layers.
Methodology
OUROBOROS employs a Controller hypernetwork that observes the mean-pooled hidden state and generates diagonal modulation vectors for LoRA targets in the recurrent block. The LoRA bases are initialized using singular value decomposition (SVD) from the residuals of removed layers, allowing the Controller to modulate pre-computed weight directions. Gated recurrence is utilized to prevent representation drift, and per-step LayerNorm is applied for stability.
Results
The implementation of OUROBOROS resulted in a 43.4% reduction in training loss compared to the unmodified 17-layer baseline, recovering 51.3% of the performance gap from layer removal. The model added only 9.2 million trainable parameters and outperformed static per-step LoRA by 1.44 loss points at depth 1, maintaining superior performance across all tested depths and ranks.
Implications
OUROBOROS has the potential to enhance the efficiency and performance of recursive transformers, making them more adaptable to varying inputs. This could lead to advancements in applications requiring deep learning models with reduced parameter counts, such as on-device AI and real-time processing tasks.
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
Time Series
Multimodal
- Introduction of a transformer model for wind-excited structural response forecasting.
- Multimodal learning from both wind features and vibration signals enhances prediction accuracy.
- Validation with real-world data from the Hardanger Bridge under variable conditions.
- Improves modal energy retention and reduces false alarms in structural health monitoring.
Read more
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
Summary
This paper presents a novel transformer-based model designed for forecasting wind-induced structural responses, specifically in the context of bridge health monitoring. The model integrates multimodal deep learning techniques to jointly analyze wind features and vibration signals, enhancing the accuracy of predictions. By leveraging the temporal characteristics of the system, the model is trained to forecast structural responses without relying on assumptions about wind stationarity or normal vibration behavior. The methodology is validated using real-world data from the Hardanger Bridge, demonstrating its effectiveness under varying environmental conditions. The results indicate that the transformer model significantly improves modal energy retention, reduces tail risks, and minimizes false alarms compared to traditional forecasting methods. This work highlights the potential of transformer-based digital twin components for advancing infrastructure management and adaptive monitoring throughout a structure's lifecycle.
Methodology
The proposed approach employs a transformer self-attention encoder-decoder architecture to analyze time series data from wind and structural vibrations. The model is trained on temporal characteristics to forecast responses and detect deviations from expected behavior, serving as an early-warning system for structural changes. It utilizes real-world measurements to validate its performance.
Results
The transformer model outperformed traditional forecasting methods by accurately capturing the dynamic behavior of structures under varying conditions. It demonstrated improved modal energy retention and reduced false alarms, indicating its robustness and reliability in practical applications.
Implications
The findings suggest that transformer-based models can significantly enhance the capabilities of digital twin frameworks in structural health monitoring, leading to better predictive maintenance and real-time monitoring of infrastructure. This could facilitate more resilient infrastructure management practices and continuous learning throughout the lifecycle of structures.
PI-JEPA: Label-Free Surrogate Pretraining for Coupled Multiphysics Simulation via Operator-Split Latent Prediction
Efficient ML
- PI-JEPA enables label-free pretraining using abundant unlabeled parameter fields.
- The framework employs masked latent prediction and PDE residual regularization.
- Significant accuracy improvements over existing methods with fewer labeled runs.
- Aligns with operator-splitting methods to handle different physical processes effectively.
Read more
PI-JEPA: Label-Free Surrogate Pretraining for Coupled Multiphysics Simulation via Operator-Split Latent Prediction
Summary
The paper introduces PI-JEPA (Physics-Informed Joint Embedding Predictive Architecture), a novel framework for surrogate pretraining in multiphysics simulations that addresses the challenge of data asymmetry in reservoir simulations. Traditional neural operator surrogates require extensive labeled simulation data, which is costly to generate, while the input parameter fields are abundant and inexpensive. PI-JEPA leverages these unlabeled parameter fields by employing masked latent prediction and per-sub-operator PDE residual regularization, allowing the model to learn from the spatial heterogeneity of the subsurface without needing completed PDE solves during pretraining. The architecture aligns with the Lie-Trotter operator-splitting method, dedicating separate latent modules for different physical processes, such as pressure and saturation transport. The framework demonstrates significant improvements in prediction accuracy, achieving 1.9Γ lower error than the Fourier Neural Operator (FNO) and 2.4Γ lower error than DeepONet with only 100 labeled runs, and a 24% improvement over supervised-only training with 500 labeled runs. This approach fundamentally reduces the simulation budget required for deploying multiphysics surrogates, making it a promising advancement in the field of reservoir engineering and scientific computing.
Methodology
PI-JEPA utilizes a two-phase training approach: pretraining on unlabeled parameter fields using masked latent prediction and fine-tuning with a limited number of labeled simulation runs. The architecture incorporates a predictor bank aligned with the Lie-Trotter operator-splitting decomposition, allowing for separate latent modules for different physical processes.
Results
On single-phase Darcy flow, PI-JEPA achieved 1.9Γ lower error compared to FNO and 2.4Γ lower error compared to DeepONet with just 100 labeled runs. Additionally, it showed a 24% improvement over traditional supervised training methods when using 500 labeled runs.
Implications
The findings suggest that PI-JEPA can significantly enhance the efficiency of multiphysics simulations in reservoir engineering and other fields by reducing the reliance on costly labeled data. This could lead to faster and more cost-effective simulation workflows, enabling real-time optimization and uncertainty quantification.
Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
Time Series
Optimization
- The study presents a systematic comparison of deep learning models and statistical methods for demand forecasting.
- N-BEATS outperformed MSTL and N-HiTS in forecasting accuracy, making it the most optimized model for this application.
- The proposed framework effectively integrates forecasting with optimization, providing a practical solution for supply chain logistics.
- The research highlights the importance of accurate demand forecasting in reducing operational costs and improving service levels.
Read more
Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
Summary
This paper addresses the challenges of demand forecasting in supply chain management, particularly the difficulties posed by seasonality, irregular spikes, and noise in retail data. The authors propose a three-step analytical framework that integrates forecasting with operational analytics. The first step involves exploratory data analysis of 180,519 transaction records to identify trends and seasonal patterns. The second step compares the forecasting performance of the N-BEATS and N-HiTS deep learning models against a statistical time series decomposition model, MSTL. The results indicate that both deep learning models significantly outperform the statistical benchmark, with N-BEATS yielding the lowest forecasting error. In the final step, the authors utilize the forecasts to inform an integer linear programming (ILP) model aimed at minimizing total delivery time while adhering to budget and capacity constraints. This integrated approach demonstrates the practical impact of accurate forecasting on logistics planning and decision-making, showcasing how precise predictions can lead to optimized operational outcomes.
Methodology
The methodology consists of three stages: (1) exploratory data analysis to identify trends and seasonal components in the dataset, (2) comparative forecasting using N-BEATS, N-HiTS, and MSTL, and (3) application of the best-performing forecasting model to an integer linear programming framework to optimize delivery logistics.
Results
The results showed that both N-BEATS and N-HiTS significantly outperformed the MSTL model in forecasting accuracy, with N-BEATS achieving the lowest forecasting error. The optimized shipping plan derived from the ILP model resulted in a feasible and cost-effective delivery strategy.
Implications
The findings suggest that integrating advanced forecasting techniques with optimization models can enhance decision-making in supply chain management, leading to improved efficiency and reduced costs. This predictive-to-prescriptive framework can be applied in various logistics and operational contexts.
DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
Theory
Efficient ML
NLP
- Introduces a self-organising transformer architecture that adapts its structure during training.
- Utilizes Deep Dual Competitive Learning to replace traditional feedforward blocks with a prototype layer.
- Implements Incremental Transformer to dynamically adjust the number of attention heads based on task requirements.
- Proves that the resulting hierarchical structure is unique, minimal, and robust to pruning.
Read more
DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
Summary
This paper presents DDCL-INCRT, a novel transformer architecture that autonomously determines its structure during training, addressing the common issue of over-specification in traditional transformer models. The architecture integrates two main concepts: Deep Dual Competitive Learning (DDCL) and Incremental Transformer (INCRT). DDCL replaces the standard feedforward block with a self-organising prototype layer that learns a dictionary of prototype vectors, allowing for dynamic adjustment of the model based on the data's directional information. INCRT manages the number of attention heads, starting with one and incrementally adding heads based on a criterion that measures residual directional information. The paper demonstrates that these two components work synergistically, leading to a unique and minimal hierarchical structure of heads that efficiently captures information from fine to coarse granularity. The theoretical findings include guarantees of stability, convergence, and robustness to pruning, establishing that the architecture is derived rather than designed, ensuring optimal performance without unnecessary complexity.
Methodology
The methodology combines two innovative approaches: DDCL, which employs a competitive prototype layer that learns and organizes itself during training, and INCRT, which incrementally adds attention heads based on the uncovered directional information in the data. This allows the model to adaptively grow and optimize its architecture without pre-defined parameters.
Results
The paper provides theoretical proofs that the DDCL-INCRT architecture converges to a unique hierarchical structure that is minimal for the given task. It shows that the architecture can effectively reduce redundancy by dynamically adjusting its components based on the data, leading to improved efficiency and performance.
Implications
The implications of this research suggest that transformer architectures can be made more efficient and effective by allowing them to self-organize based on the task at hand. This could lead to advancements in various applications of transformers in NLP and other domains, reducing computational costs and improving model performance.
JetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physics
Generative Models
- JetPrism addresses the premature plateau of standard CFM loss metrics, providing a more reliable convergence diagnostic.
- The framework incorporates a multi-metric evaluation protocol that includes ΟΒ² statistics, W1 distances, and correlation matrix distances.
- Validation of JetPrism is performed using synthetic stress tests and a Jefferson Lab dataset relevant to the Electron-Ion Collider.
- The proposed methodology ensures precise statistical agreement with ground-truth data without memorizing the training set.
Read more
JetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physics
Summary
This paper introduces JetPrism, a novel framework designed to address the challenges of convergence diagnostics in generative simulations and inverse problems within nuclear physics. The authors highlight the limitations of Conditional Flow Matching (CFM) as a method for accelerating Monte Carlo simulations and detector unfolding, particularly its tendency to plateau prematurely in training loss, which can mislead researchers regarding the model's convergence and physical fidelity. JetPrism serves as a configurable CFM framework that provides a more reliable evaluation of generative models through a multi-metric protocol that includes various statistical measures. The authors validate JetPrism using synthetic stress tests and a realistic dataset from Jefferson Lab, demonstrating that physics-informed metrics can continue to improve long after standard loss metrics indicate convergence. This work emphasizes the importance of domain-specific evaluations over generic loss metrics, establishing JetPrism as a dependable tool for ensuring accurate statistical agreement with ground-truth data. The framework's applicability extends beyond nuclear physics to fields such as medical imaging, astrophysics, semiconductor discovery, and quantitative finance, where high-fidelity simulations and rigorous inversion are crucial.
Methodology
The JetPrism framework utilizes Conditional Flow Matching (CFM) to generate kinematic event data and perform conditional detector unfolding. It incorporates a multi-metric evaluation protocol to assess convergence and generative fidelity, employing statistical measures such as ΟΒ², W1 distances, and correlation matrix distances. The framework is validated through synthetic stress tests and real datasets from Jefferson Lab.
Results
The results demonstrate that JetPrism provides significant improvements in evaluating generative models, with physics-informed metrics showing continued enhancement long after standard loss metrics indicate convergence. The framework successfully models event generation and detector unfolding, ensuring high fidelity in statistical agreement with ground-truth data.
Implications
JetPrism's diagnostic capabilities can enhance the reliability of generative models in various scientific fields, facilitating more accurate simulations and inversions. Its applicability to diverse domains suggests potential advancements in areas requiring high-fidelity data analysis and modeling.
Residuals-based Offline Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduces a residuals-based offline RL framework that mitigates data coverage limitations.
- Defines a residuals-based Bellman optimality operator that incorporates estimation errors.
- Develops a residuals-based offline DQN algorithm for practical implementation.
- Demonstrates effectiveness in a stochastic CartPole environment, showing improved policy learning.
Read more
Residuals-based Offline Reinforcement Learning
Summary
This paper presents a novel framework for offline reinforcement learning (RL) that addresses key limitations associated with existing methods, particularly in high-stakes applications where interaction with the real environment is not feasible. The authors introduce a residuals-based Bellman optimality operator that incorporates estimation errors in transition dynamics into policy optimization. This operator is shown to be a contraction mapping, with conditions established for its fixed point to be asymptotically optimal and to possess finite-sample guarantees. The proposed framework allows for learning policies without the need for comprehensive data coverage across state-action spaces, as it generates unseen states through empirical residuals. The authors further develop a residuals-based offline deep Q-learning (DQN) algorithm and validate its effectiveness through experiments in a stochastic CartPole environment, demonstrating improved performance over traditional offline RL approaches.
Methodology
The authors construct an estimated transition model from static offline data using supervised learning. They compute empirical residuals to capture discrepancies between the learned model and true dynamics, generating trajectories by sampling these residuals. The framework allows for policy training using these generated trajectories, thus addressing limitations in offline RL regarding state-action coverage and distribution shift.
Results
The proposed residuals-based offline DQN algorithm was tested in a stochastic CartPole environment, where it demonstrated superior performance compared to existing offline RL methods. The results indicated that the algorithm effectively learned optimal policies even with limited data coverage and addressed issues related to distribution shift.
Implications
This work has significant implications for high-stakes applications in fields such as healthcare, transportation, and energy, where offline RL can be safely deployed without the risks associated with online learning. The framework can facilitate better decision-making in scenarios where data is limited or costly to collect.
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Reinforcement Learning
Large Language Models
Robotics
- SKILL0 is the first RL framework that explicitly formulates skill internalization as a training objective.
- In-context reinforcement learning allows for structured skill guidance during training while removing it at inference.
- Dynamic Curriculum adapts the retention of skills based on their on-policy helpfulness, enhancing the internalization process.
- SKILL0 achieves substantial performance improvements over traditional RL methods while maintaining a low token overhead.
Read more
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Summary
The paper introduces SKILL0, a novel reinforcement learning framework aimed at internalizing skills into model parameters, thereby enabling zero-shot autonomous behavior without runtime skill retrieval. Traditional methods rely on inference-time skill augmentation, which suffers from issues such as retrieval noise and token overhead. SKILL0 addresses these limitations by employing an in-context reinforcement learning (ICRL) approach, where skills are initially provided as guidance during training but are completely removed during inference. This transition is facilitated through a dynamic curriculum that evaluates the helpfulness of skills and retains only those that benefit the current policy. The authors demonstrate that SKILL0 significantly outperforms standard RL baselines and maintains an efficient context size, achieving improvements of +9.7% for ALFWorld and +6.6% for Search-QA while using fewer than 0.5k tokens per step. The framework represents a shift from dependence on external skill context to intrinsic competence, marking a significant advancement in the development of autonomous agents.
Methodology
The methodology involves an in-context reinforcement learning framework where skills are provided during training but removed during inference. A dynamic curriculum assesses the helpfulness of skills, retaining only those beneficial to the current policy. Skills are grouped by category and rendered with interaction history to facilitate learning.
Results
SKILL0 demonstrated significant performance improvements over standard RL baselines, achieving +9.7% on ALFWorld and +6.6% on Search-QA. The framework maintained an efficient context size of fewer than 0.5k tokens per step, indicating reduced inference overhead without compromising task performance.
Implications
The implications of SKILL0 extend to enhancing the capabilities of autonomous agents in various applications, including robotics, interactive systems, and complex decision-making tasks, by enabling them to learn and execute skills independently without reliance on external guidance.
Learning ECG Image Representations via Dual Physiological-Aware Alignments
Multimodal
Time Series
Computer Vision
- Introduces ECG-Scan, a self-supervised framework for ECG image representation learning.
- Utilizes dual physiological-aware alignments for multimodal contrastive learning.
- Incorporates soft-lead constraints to improve signal lead inter-consistency.
- Demonstrates superior performance of the image-based model over existing baselines.
Read more
Learning ECG Image Representations via Dual Physiological-Aware Alignments
Summary
This paper presents ECG-Scan, a self-supervised framework designed to learn clinically generalized representations from ECG images, addressing the limitations of existing automated ECG analysis methods that primarily rely on raw signal recordings. The authors introduce a dual physiological-aware alignment approach that optimizes image representation learning through multimodal contrastive alignment between ECG images and gold-standard signal-text modalities. Additionally, the framework incorporates domain knowledge via soft-lead constraints to enhance the reconstruction process and ensure signal lead inter-consistency. The extensive experiments conducted across multiple datasets and downstream tasks demonstrate that the proposed image-based model significantly outperforms existing image baselines and effectively narrows the performance gap between ECG image analysis and signal analysis. This work underscores the potential of self-supervised image modeling to leverage large-scale legacy ECG data, thereby improving access to automated cardiovascular diagnostics, particularly in resource-constrained settings.
Methodology
The methodology involves generating synthetic ECG images from a large-scale ECG signal-text dataset, followed by a self-supervised learning approach that aligns ECG images with signal and text representations in a shared latent space. The dual physiological-aware alignments are achieved through Gramian-based contrastive learning and soft-lead constraints to enhance representation quality.
Results
The results indicate that the ECG-Scan framework achieves superior performance across various datasets and downstream tasks compared to existing image-based methods. The model effectively narrows the performance gap between ECG image analysis and traditional signal analysis, showcasing its robustness and transferability.
Implications
The findings suggest that self-supervised learning techniques can significantly enhance the interpretation of ECG images, making automated cardiovascular diagnostics more accessible, especially in low-resource settings. This could lead to improved early detection and monitoring of cardiovascular diseases globally.
Learn by Surprise, Commit by Proof
Large Language Models
NLP
Optimization
- LSCP allows models to autonomously learn new information by verifying it against existing knowledge.
- The framework uses a self-gating mechanism to adjust learning intensity based on the model's conviction in new information.
- Experiments show that LSCP significantly reduces rote memorization compared to standard fine-tuning.
- The approach mimics biological memory processes, consolidating temporary information into long-term memory.
Read more
Learn by Surprise, Commit by Proof
Summary
This paper introduces LSCP, a self-gated post-training framework designed for autonomous knowledge acquisition in language models. LSCP enables models to learn only what they do not already know by verifying new information against existing knowledge without relying on external oracles. The framework operates by flagging passages with high per-token loss, generating a question-and-answer chain to assess the model's knowledge gaps, and adjusting the AdamW optimizer's Ξ²2 parameter based on the depth of conviction (k) achieved through self-verification. This process not only facilitates the acquisition of new knowledge but also refines existing knowledge, addressing issues of hallucination. The model's learning intensity is controlled by a single parameter, r, and the framework mimics biological memory consolidation by transferring temporary information into long-term memory. Experimental results demonstrate that while standard fine-tuning leads to rote memorization, LSCP conditions achieve semantic learning, significantly improving knowledge retention and accuracy in adjacent questions. The findings suggest that the training data format is crucial in preventing memorization, while the gating mechanism protects existing knowledge from being contaminated by incorrect information.
Methodology
The LSCP framework consists of three stages: detecting surprising passages based on high per-token loss, generating Q&A pairs for self-verification, and adjusting the AdamW optimizer's Ξ²2 parameter to facilitate learning from verified content. This self-gated approach allows the model to selectively update its weights based on the consistency of new information with existing knowledge.
Results
Experiments conducted on the Qwen3-14B model and various other models (8Bβ32B) revealed that LSCP conditions achieved semantic learning rates 2.7 to 3.0 times better than standard fine-tuning, which resulted in rote memorization. The r = 1.0 condition confirmed that the training data format was the primary factor preventing memorization, while the gating mechanism maintained high accuracy on adjacent questions.
Implications
The LSCP framework has potential applications in improving the learning capabilities of language models, enabling them to adapt and incorporate new knowledge more effectively. This could enhance their performance in dynamic environments where information is constantly evolving, such as in scientific research or real-time data analysis.
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Graph Learning
Multimodal
Robotics
- LEO integrates Graph Attention Networks for adaptive shape estimation in extended object tracking.
- The framework utilizes a unique parallelogram-based ground-truth formulation to represent complex geometries.
- A dual-attention mechanism enhances the robustness of sensor fusion by capturing temporal and spatial dependencies.
- LEO demonstrates real-time performance suitable for production systems in autonomous driving.
Read more
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Summary
The paper presents LEO (Learned Extension of Objects), a novel spatio-temporal Graph Attention Network (GAT) designed for the fusion and tracking of dynamic objects in autonomous driving applications. LEO addresses the limitations of classical Bayesian extended-object models, which require complete a-priori and update-likelihood functions, and deep learning methods that struggle with data annotations and computational demands. By leveraging multi-modal sensor data, including LiDAR, RADAR, and cameras, LEO learns adaptive fusion weights and ensures temporal consistency while modeling complex object geometries, such as articulated trucks. The framework employs a parallelogram-based ground-truth formulation to generalize object shapes and incorporates a dual-attention mechanism to capture both intra-modal temporal dynamics and inter-modal spatial dependencies. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate LEO's real-time computational efficiency and robustness across diverse driving scenarios, with additional validation on public datasets confirming its cross-dataset generalization capabilities.
Methodology
LEO employs a spatio-temporal architecture based on Graph Attention Network blocks to fuse multi-modal sensor tracks. It introduces a parallelogram-based ground-truth formulation for shape representation and utilizes a dual-attention mechanism to effectively capture both temporal dynamics and spatial dependencies across different sensor modalities.
Results
The evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset showed that LEO achieves accurate shape and trajectory estimations in real-time, making it suitable for production environments. The framework also demonstrated strong performance in cross-dataset evaluations, confirming its adaptability and robustness across various driving scenarios.
Implications
LEO's advancements in extended object tracking and sensor fusion could significantly enhance the reliability and safety of autonomous driving systems, enabling better situational awareness and decision-making capabilities in complex urban environments.
FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
NLP
Large Language Models
Efficient ML
- FourierMoE reformulates adaptation in the spectral domain to enhance multi-task fine-tuning.
- The method employs a frequency-adaptive router to allocate tasks to experts based on frequency specialization.
- FourierMoE achieves superior performance across 28 benchmarks with fewer trainable parameters compared to traditional methods.
- The integration of complex coefficients allows for a complete representation of spectral information, improving adaptation efficiency.
Read more
FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Summary
The paper introduces FourierMoE, a novel approach for parameter-efficient fine-tuning (PEFT) of large language models (LLMs) that addresses the challenges of multi-task learning. Traditional PEFT methods often face issues such as task interference and representational limitations, particularly in multi-task settings. FourierMoE innovatively reformulates the adaptation process in the spectral domain, leveraging insights from spectral analysis that reveal distinct frequency energy distributions across tasks and heterogeneous frequency sensitivities in LLM layers. The proposed method integrates a mixture-of-experts (MoE) architecture with the inverse discrete Fourier transform (IDFT), allowing for frequency-aware adaptation. This involves a frequency-adaptive router that directs tokens to experts specialized in different frequency bands, where each expert learns complex coefficients that maintain phase and amplitude information. The extensive evaluation across 28 benchmarks demonstrates that FourierMoE consistently outperforms existing methods in both single-task and multi-task scenarios while utilizing significantly fewer trainable parameters, showcasing the potential of spectral-domain adaptation for efficient LLM fine-tuning.
Methodology
FourierMoE utilizes a mixture-of-experts architecture combined with the inverse discrete Fourier transform (IDFT) to facilitate frequency-aware adaptation. It incorporates a frequency-adaptive router that directs input tokens to experts trained on specific frequency bands. Each expert learns conjugate-symmetric complex coefficients, ensuring that both phase and amplitude information are preserved during the adaptation process.
Results
The results indicate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings across 28 diverse benchmarks. The method demonstrates significant improvements in performance while requiring fewer trainable parameters, validating its effectiveness as a parameter-efficient adaptation strategy for large language models.
Implications
The findings suggest that adapting large language models through spectral-domain methods can lead to more efficient and effective multi-task learning. This approach may have broader applications in natural language processing tasks, particularly in scenarios where computational resources are limited.
ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor
Generative Models
Efficient ML
- ZEUS uses a second-order predictor to effectively reduce denoiser evaluations, simplifying the acceleration process.
- The method avoids the complexities of higher-order predictors that can degrade output quality under aggressive speedups.
- ZEUS maintains compatibility with various model architectures and requires minimal code changes for integration.
- The approach achieves significant speed improvements while preserving perceptual fidelity in generated outputs.
Read more
ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor
Summary
The paper introduces ZEUS, a novel method aimed at accelerating diffusion models by utilizing a second-order predictor to reduce the number of denoiser evaluations required during sampling. Traditional denoising generative models face significant latency issues due to the iterative nature of denoiser calls, particularly in large models. Existing training-free acceleration methods often complicate the architecture or require extensive modifications, which can hinder deployment. ZEUS addresses these challenges by proposing a simpler, more efficient approach that leverages only second-order predictions, avoiding the pitfalls of higher-order methods that tend to amplify errors under aggressive speedups. The interleaved scheme employed by ZEUS stabilizes the prediction process, allowing for substantial speed improvements without compromising the quality of generated outputs. The method is compatible with various model architectures and requires minimal integration effort, making it a practical solution for enhancing the performance of generative models across different modalities, including image and video generation.
Methodology
ZEUS employs a second-order numerical predictor that extrapolates the next denoiser output based on the most recent full-evaluation step. It utilizes an interleaved caching scheme to maintain stability and precision during long-range approximations, avoiding the amplification of errors that can occur with consecutive extrapolations.
Results
ZEUS demonstrates up to 3.2Γ speedup in end-to-end generation times for both image and video models while maintaining high perceptual quality. The method outperforms existing training-free acceleration techniques, achieving better speed-fidelity trade-offs across various generative tasks.
Implications
The findings suggest that simpler acceleration methods can be more effective than complex architectures, potentially influencing future research in generative modeling and real-time applications. ZEUS's compatibility with various architectures and minimal integration requirements make it a valuable tool for practitioners in the field.
The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning
Reinforcement Learning
Theory
Optimization
- Introduces a theoretical framework for understanding plasticity loss in deep RL.
- Identifies two mechanisms contributing to plasticity loss: NTK rank collapse and gradient decay.
- Proposes Sample Weight Decay (SWD) as a solution to restore gradient magnitude.
- Demonstrates SWD's effectiveness across multiple RL algorithms and environments.
Read more
The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning
Summary
This paper addresses the issue of plasticity loss in deep reinforcement learning (RL), which hampers the ability of neural networks to adapt to new data over time. The authors explore the theoretical underpinnings of plasticity loss, identifying two primary mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the decay of gradient magnitude during the online learning process. They propose a novel method called Sample Weight Decay (SWD) to counteract the gradient attenuation, which is orthogonal to existing empirical remedies such as network reset and noise injection. SWD employs a linearly weighted sampling strategy that decreases the probability of selecting older samples, thereby maintaining a stable gradient magnitude. The efficacy of SWD is validated through experiments on various RL algorithms, including TD3, Double DQN, and SAC, across multiple environments like MuJoCo and the DeepMind Control Suite. The results indicate that SWD significantly mitigates plasticity loss and enhances learning performance, achieving state-of-the-art results in challenging tasks.
Methodology
The authors develop a theoretical framework to analyze plasticity loss in deep RL, focusing on the optimization dynamics of RL agents. They propose SWD, a sampling strategy that adjusts the probability of selecting samples based on their age, thereby addressing the gradient decay issue. The method is evaluated on established RL benchmarks using various algorithms.
Results
The experiments show that SWD effectively alleviates plasticity loss, leading to improved learning performance across different configurations of RL algorithms. SWD consistently outperforms existing methods, achieving state-of-the-art results on challenging tasks in the DeepMind Control Suite and other environments.
Implications
The findings suggest that addressing plasticity loss through theoretical insights can lead to more robust and efficient reinforcement learning algorithms. SWD could be integrated into existing RL frameworks to enhance their adaptability and performance in dynamic environments.
Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
Reinforcement Learning
Large Language Models
Interpretability
- Introduction of Influence-Guided PPO (I-PPO) to filter out unfaithful episodes in RL training.
- Demonstrated significant improvements in training efficiency and model performance compared to traditional PPO and SFT.
- I-PPO acts as an intrinsic early stopping mechanism, dynamically reducing the rollout buffer volume.
- The method provides a fine-grained analysis revealing its effectiveness in detecting unfaithful reasoning episodes.
Read more
Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
Summary
This paper addresses the inefficiencies in traditional Reinforcement Learning (RL) methods, particularly Proximal Policy Optimization (PPO), which often trains on entire rollout buffers that may contain noisy or unfaithful reasoning episodes. The authors propose a novel framework called Influence-Guided PPO (I-PPO) that integrates data attribution principles into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and filters out episodes that negatively impact model performance. The experiments conducted demonstrate that I-PPO consistently outperforms standard fine-tuning (SFT) and traditional PPO baselines across various reasoning domains. The filtering process not only accelerates training efficiency by reducing the volume of the rollout buffer but also serves as an intrinsic early stopping mechanism, leading to improved model performance and reduced unfaithful reasoning. The findings suggest that I-PPO can effectively enhance the reasoning capabilities of Large Language Models (LLMs) during post-training.
Methodology
The authors developed I-PPO by incorporating data attribution techniques to compute influence scores for each episode in the rollout buffer. This score is derived from the gradient alignment between the episodes and a validation set, allowing for the identification of episodes that detract from the model's learning objectives. Negative influence episodes are filtered out before policy updates, enhancing the overall training process.
Results
The experiments showed that I-PPO outperformed both SFT and traditional PPO methods across mathematical, physical, and social reasoning tasks. The filtering mechanism significantly accelerated training and improved model performance by reducing the noise from unfaithful reasoning episodes.
Implications
The findings suggest that integrating data attribution into RL post-training can lead to more efficient training processes for LLMs, potentially enhancing their reasoning capabilities and applicability in real-world scenarios. This approach could be beneficial in various domains where model interpretability and performance are critical.
Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
Computer Vision
Interpretability
- Expert evaluations significantly improve the quality of uncertainty estimates in medical AI.
- The proposed method separates uncertainty into epistemic and aleatoric components using expert-generated soft labels.
- A two-ensemble approach effectively estimates both types of uncertainty, outperforming existing methods.
- The method shows substantial improvements across multiple medical tasks, enhancing AI reliability.
Read more
Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
Summary
This paper addresses the critical challenge of uncertainty in medical AI systems, which can lead to severe consequences if errors occur. The authors propose a novel approach that integrates expert knowledge into uncertainty estimation, particularly focusing on aleatoric uncertainty, which is often difficult to quantify. By leveraging disagreement among expert responses, the authors generate 'soft' labels that help train machine learning models to estimate both epistemic and aleatoric uncertainty separately. This is achieved through a two-ensemble method, where one ensemble predicts hard labels and the other is fine-tuned on expert evaluations. The proposed method was validated across various medical tasks, demonstrating significant improvements in uncertainty estimation quality. The results indicate that incorporating expert insights can enhance the reliability of AI systems in healthcare, making them more effective in high-stakes environments.
Methodology
The authors developed a framework that utilizes expert knowledge to enhance uncertainty estimation in medical AI. They employed a two-ensemble approach where one ensemble predicts hard labels, and the other is fine-tuned on soft labels derived from expert evaluations. This method allows for the separate estimation of epistemic and aleatoric uncertainty using the law of total variance. Additionally, a lightweight one-ensemble variant was introduced for practical applications.
Results
The proposed approach achieved a 9% improvement in multiple-choice question answering on the PubMedQA dataset, a 50% improvement in image classification on the BloodyWell dataset, a 7% improvement in binary image segmentation on the LIDC-IDRI dataset, and a 49% improvement in multiclass image segmentation on the RIGA dataset compared to the second-best solution. The method demonstrated significant enhancements in uncertainty estimation quality across various medical tasks.
Implications
The findings suggest that integrating expert knowledge into AI systems can lead to more reliable and risk-aware applications in healthcare. This approach not only improves diagnostic accuracy but also helps medical professionals focus on high-risk cases, ultimately enhancing patient safety and care quality.
Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Efficient ML
NLP
Large Language Models
- Introduction of Head-Calibrated Clipped-Linear Softmax (HCCS) as a surrogate for softmax in quantized multi-head attention.
- HCCS preserves the ordering of logits and generates stable probability distributions without explicit exponentiation.
- Lightweight per-head calibration method enhances the accuracy of approximations across diverse attention heads.
- First int8-optimized softmax implementation for AMD Versal AI Engine, achieving higher throughput than existing BF16 implementations.
Read more
Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Summary
This paper addresses the computational bottleneck posed by the softmax function in the Multi-Head Attention (MHA) block of Transformer models, particularly in low-precision inference scenarios. The authors propose a novel approach called Head-Calibrated Clipped-Linear Softmax (HCCS), which serves as a bounded, monotonic surrogate to the traditional softmax function. HCCS utilizes a clipped linear mapping of max-centered attention logits, ensuring stable probability distributions while maintaining the ordering of the original logits. A key innovation of HCCS is its lightweight calibration method, which optimizes parameters for each attention head based on representative datasets, thus preserving the statistical properties of individual heads. The implementation is specifically designed for high-throughput scenarios on AMD Versal AI Engines, leveraging the int8 multiply-accumulate (MAC) units for efficient computation. The results demonstrate that HCCS significantly outperforms existing softmax implementations in terms of speed while maintaining competitive accuracy in small or heavily quantized MHA workloads after quantization-aware retraining.
Methodology
The authors developed HCCS as a softmax surrogate that avoids the computational overhead of exponentiation by using a clipped linear mapping. They implemented a calibration method that optimizes parameters for each attention head based on a representative dataset. The implementation was tailored for AMD Versal AI Engines, utilizing the architecture's native int8 MAC capabilities to enhance throughput.
Results
HCCS demonstrated significantly higher throughput compared to AMD's BF16 reference softmax implementation while maintaining competitive task accuracy in small and quantization-stressed MHA workloads. The approach effectively reduced the computational burden associated with softmax in low-precision settings.
Implications
The proposed HCCS method has significant implications for deploying Transformer models in edge computing environments, where efficiency and speed are critical. It opens avenues for further research into hardware-optimized machine learning algorithms, particularly in low-precision contexts.
SECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous Driving
Computer Vision
- SECURE framework enhances robustness in accident anticipation models.
- Identifies significant instability in existing models like CRASH under perturbations.
- Introduces a multi-objective loss function for fine-tuning model parameters.
- Achieves state-of-the-art results on DAD and CCD datasets.
Read more
SECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous Driving
Summary
The paper introduces SECURE, a framework designed to enhance the robustness of accident anticipation models in autonomous driving, particularly addressing the limitations of existing models like CRASH. Despite achieving high performance, CRASH exhibits significant instability in predictions when subjected to minor input perturbations, which poses reliability risks in real-world applications. SECURE is built on four key attributes that define model robustness: consistency and stability in both prediction and latent feature spaces. The authors propose a multi-objective loss function that fine-tunes a baseline model to minimize divergence from a reference model while penalizing sensitivity to adversarial perturbations. Experiments conducted on the DAD and CCD datasets demonstrate that SECURE significantly improves robustness against various perturbations and enhances performance on clean data, achieving new state-of-the-art results. This work highlights the importance of robustness in intelligent transportation systems and provides a systematic approach to evaluate and improve the reliability of accident anticipation frameworks.
Methodology
The authors conducted a comprehensive robustness analysis of the CRASH model, revealing its vulnerabilities to input perturbations. They defined a rigorous framework for SECURE that emphasizes consistency and stability in predictions. A multi-objective loss function was developed to fine-tune the model, minimizing divergence from a reference model and penalizing sensitivity to adversarial inputs.
Results
The experiments showed that SECURE significantly enhances the robustness of accident anticipation models against various perturbations while also improving performance on clean datasets. The results indicate that SECURE achieves new state-of-the-art performance, demonstrating its effectiveness in addressing the reliability issues of existing models.
Implications
The findings suggest that enhancing model robustness is crucial for the deployment of autonomous driving systems in real-world environments. SECURE can serve as a foundational framework for developing more reliable accident anticipation models, ultimately contributing to safer intelligent transportation systems.
UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression
Time Series
Theory
Efficient ML
- UQ-SHRED provides valid uncertainty quantification for sparse sensing problems.
- The framework utilizes noise injection and energy score minimization for efficient distributional learning.
- The method is validated across multiple scientific datasets, showcasing its versatility.
- Theoretical guarantees are established for the learned conditional distribution.
Read more
UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression
Summary
The paper introduces UQ-SHRED, a novel framework for uncertainty quantification in the context of sparse sensing using shallow recurrent decoder networks. The SHRED architecture has shown promise in reconstructing high-dimensional spatiotemporal fields from limited sensor measurements, but it lacks robust uncertainty estimation, which is crucial for applications in complex and stochastic systems. UQ-SHRED addresses this limitation by employing a distributional learning approach that learns the predictive distribution of spatial states conditioned on sensor history. This is achieved through a method called engression, which involves injecting stochastic noise into sensor inputs and training the model with an energy score loss. This approach allows UQ-SHRED to produce well-calibrated predictive distributions without the need for retraining or additional network structures. The framework is validated on various datasets, including turbulent flow and atmospheric dynamics, demonstrating its effectiveness in providing uncertainty quantification across diverse scientific applications. The paper also includes ablation studies to analyze the impact of different model settings on performance, confirming the robustness of UQ-SHRED in uncertainty-aware analysis.
Methodology
UQ-SHRED employs a distributional learning framework that integrates noise injection into the input of the SHRED architecture. It uses an energy score loss for training, optimizing a proper scoring rule for distributional predictions. This allows the model to generate predictive distributions by sampling from the conditional input noise distribution and propagating these samples through the network.
Results
UQ-SHRED demonstrated effective uncertainty quantification across five complex real-world datasets, including sea-surface temperature, turbulent flows, neural activity, solar activity, and propulsion physics. The framework produced well-calibrated confidence intervals and maintained performance across diverse applications, with ablation studies indicating the influence of various model parameters on uncertainty estimation.
Implications
The UQ-SHRED framework has significant implications for scientific fields requiring reliable uncertainty quantification in sparse sensing scenarios, such as fluid dynamics, neuroscience, and environmental monitoring. Its ability to provide valid uncertainty estimates can enhance decision-making processes in safety-critical applications.
Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Reinforcement Learning
Robotics
Theory
- Introduces a framework for MBRL that accommodates time-varying dynamics.
- Develops two algorithms (R-OMBRL and SW-OMBRL) that utilize adaptive data buffers.
- Establishes theoretical guarantees for dynamic regret in the context of non-stationarity.
- Demonstrates improved performance on continuous control benchmarks.
Read more
Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Summary
This paper addresses the limitations of traditional model-based reinforcement learning (MBRL) methods that assume stationary dynamics, which often do not hold in real-world applications due to factors like drift and changing conditions. The authors propose a continual MBRL framework that adapts to time-varying dynamics by utilizing Gaussian process dynamics models and implementing a variation-budget approach. They introduce two algorithms, R-OMBRL and SW-OMBRL, which incorporate adaptive data buffer mechanisms to manage the influence of outdated data. The theoretical analysis reveals that limiting the use of stale data is crucial for maintaining calibrated uncertainty and achieving meaningful dynamic regret guarantees. The proposed methods demonstrate improved performance on continuous control tasks with non-stationary dynamics compared to existing MBRL baselines, showcasing their effectiveness in real-world scenarios where system dynamics are not constant.
Methodology
The authors employ Gaussian process dynamics models to represent the system's dynamics and utilize Bayesian approaches to manage uncertainty. They propose two algorithms that adaptively select data for training: R-OMBRL uses periodic resets of the data buffer, while SW-OMBRL employs a sliding window approach to retain only recent data. The theoretical framework includes deriving dynamic regret bounds based on the retained data horizon and the variation in dynamics.
Results
The proposed algorithms, R-OMBRL and SW-OMBRL, show significant improvements in performance on continuous control tasks with non-stationary dynamics compared to traditional MBRL methods. The theoretical analysis supports the effectiveness of limiting the influence of stale data, leading to better calibrated uncertainty and dynamic regret guarantees.
Implications
This work has significant implications for real-world applications in robotics and control systems where dynamics are often non-stationary. The proposed methods can enhance the robustness and adaptability of learning-based control systems, making them more effective in dynamic environments.
Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training
Large Language Models
Reinforcement Learning
Optimization
- ES can match or exceed GRPO in task accuracy across various settings.
- The geometric properties of updates differ significantly between ES and GRPO.
- ES exhibits a random-walk-like behavior in high-dimensional parameter spaces.
- The study provides a theoretical framework for understanding ES's performance.
Read more
Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training
Summary
This paper investigates the performance and geometric properties of Evolution Strategies (ES) compared to Group Relative Policy Optimization (GRPO) in the context of fine-tuning large language models (LLMs). The authors conduct experiments across four tasks in both single-task and sequential continual-learning settings, finding that ES matches or exceeds GRPO in task accuracy while exhibiting distinct model update behaviors. ES produces larger, more diffuse updates that induce broader off-task KL drift, whereas GRPO makes smaller, localized updates. Despite these differences, the solutions from both methods are linearly connected without a loss barrier, indicating that they can achieve similar task performance through different geometric pathways. The authors develop a theoretical framework explaining how ES can navigate high-dimensional parameter spaces, accumulating significant off-task movement while still making task progress. This work highlights the implications of using gradient-free versus gradient-based fine-tuning methods for knowledge preservation and forgetting in LLMs.
Methodology
The authors compare ES and GRPO by fine-tuning LLMs across four tasks, analyzing the geometric characteristics of the solutions found by each method. They derive a theoretical account of ES's behavior in high-dimensional spaces, focusing on the nature of weight updates and their implications for task performance and knowledge retention.
Results
The results show that ES achieves comparable or superior accuracy to GRPO while making larger and more diffuse updates. The analysis reveals that ES navigates a broader, low curvature plateau in the loss landscape, leading to substantial off-task movement without a loss barrier between the solutions of the two methods. Empirical observations align with the theoretical predictions regarding the scaling of weight changes and the behavior of updates in parameter space.
Implications
The findings suggest that different optimization strategies can lead to similar performance outcomes in LLM fine-tuning, but with distinct implications for knowledge retention and forgetting. This has important consequences for the design of fine-tuning protocols in continual learning scenarios, particularly in applications where preserving prior knowledge is critical.
CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift
Time Series
- CANDI addresses the critical issue of distribution shifts in MTSAD, which can lead to significant false positives.
- The framework employs False Positive Mining to curate informative samples for adaptation.
- CANDI uses a lightweight Spatiotemporally-Aware Normality Adaptation module to update the model without overwriting pre-trained knowledge.
- The proposed method shows substantial performance improvements over baseline methods, with a 14% increase in AUROC.
Read more
CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift
Summary
The paper addresses the challenge of multivariate time-series anomaly detection (MTSAD) under distribution shifts, which often lead to increased false positives in pre-trained models. The authors propose CANDI, a novel test-time adaptation (TTA) framework that selectively identifies and adapts to potential false positives while preserving the knowledge of the pre-trained model. CANDI incorporates a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and a Spatiotemporally-Aware Normality Adaptation (SANA) module for informed model updates. The framework is built on a reconstruction-based anomaly detector and aims to enhance robustness and accuracy in real-world applications where distribution shifts are common. Extensive experiments demonstrate that CANDI significantly improves MTSAD performance, achieving up to a 14% increase in AUROC while utilizing less than 2% of the total test data for adaptation.
Methodology
CANDI employs a two-pronged approach: it utilizes False Positive Mining (FPM) to identify potentially misclassified normal samples based on their anomaly scores and latent space proximity, and it incorporates a Spatiotemporally-Aware Normality Adaptation (SANA) module that allows for targeted updates to the model while keeping the backbone frozen. This method ensures that the model adapts to new patterns without losing the robustness of the pre-trained knowledge.
Results
The experiments conducted show that CANDI significantly outperforms existing MTSAD baselines, achieving an improvement of up to 14% in AUROC while requiring less than 2% of the total test data for adaptation. This indicates that CANDI is not only effective but also efficient in adapting to distribution shifts.
Implications
The findings suggest that CANDI can be effectively applied in high-stakes environments such as industrial maintenance and healthcare monitoring, where accurate anomaly detection is critical. The ability to adapt to changing data distributions in real-time could enhance the reliability and safety of systems that rely on continuous monitoring.
Robust Graph Representation Learning via Adaptive Spectral Contrast
Graph Learning
Theory
Optimization
- Identifies a spectral dilemma in graph representation learning where high-frequency signals are crucial but sensitive to noise.
- Proposes ASPECT, a framework that uses a reliability-aware spectral gating mechanism to enhance robustness.
- Demonstrates that existing global spectral fusion strategies are suboptimal for mixed graphs.
- Achieves state-of-the-art performance on 8 out of 9 benchmarks, particularly on heterophilic graphs.
Read more
Robust Graph Representation Learning via Adaptive Spectral Contrast
Summary
This paper addresses the challenges in spectral graph contrastive learning, particularly the trade-off between utilizing high-frequency signals for encoding heterophilic structures and the increased sensitivity of these signals to noise. The authors introduce a novel framework called ASPECT (Adaptive Spectral Contrast for Targeted robustness) that employs a reliability-aware spectral gating mechanism to dynamically adjust the reliance on frequency channels based on their stability against perturbations. The theoretical analysis reveals that existing global spectral fusion strategies are suboptimal, leading to non-vanishing regret in mixed graphs with varying node-wise frequency preferences. ASPECT is formulated as a minimax game, optimizing a node-wise gate against a spectral adversary to enhance robustness in learned representations. Empirical evaluations demonstrate that ASPECT achieves state-of-the-art performance on 8 out of 9 benchmarks, effectively distinguishing meaningful structural heterophily from incidental noise.
Methodology
The authors developed ASPECT, which formulates a minimax game to optimize a node-wise gate that adjusts the reliance on frequency channels based on their stability against adversarial perturbations. The framework incorporates a Rayleigh quotient penalty to target spectral energy distributions, enhancing the robustness of the learned representations.
Results
ASPECT outperformed existing methods on 8 out of 9 benchmarks, particularly excelling in scenarios involving heterophilic graphs. The analysis of the learned gate values showed a strong correlation with ground-truth local homophily, indicating effective disentanglement of structural signals from noise.
Implications
The findings suggest that robustness in spectral graph learning is essential for generalizing representations across mixed structures. This work could influence future research in graph learning methodologies, particularly in applications where noise and structural variability are prevalent.
Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method
Optimization
Efficient ML
Theory
- Sven optimizes neural networks by treating each data point's residual as a separate condition.
- It approximates the Moore-Penrose pseudoinverse using truncated SVD, allowing for efficient computation.
- Sven outperforms traditional optimization methods like Adam in terms of convergence speed and final loss.
- The method is particularly suited for over-parameterized models and can be applied to scientific computing tasks.
Read more
Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method
Summary
The paper introduces Sven (Singular Value dEsceNt), a novel optimization algorithm for neural networks that leverages the natural decomposition of loss functions into individual data point contributions. Unlike traditional methods that reduce the loss to a single scalar, Sven treats each data point's residual as a separate condition to be satisfied simultaneously. It employs the Moore-Penrose pseudoinverse of the loss Jacobian to compute the minimum-norm parameter update that addresses all conditions at once. To make this approach computationally feasible, Sven approximates the pseudoinverse using a truncated singular value decomposition (SVD), retaining only the k most significant directions, resulting in a computational overhead proportional to k, compared to the quadratic scaling of traditional natural gradient methods. The authors demonstrate that Sven outperforms standard first-order methods like Adam in regression tasks, achieving faster convergence and lower final loss while being competitive with LBFGS at a fraction of the computational cost. The paper also discusses challenges related to memory overhead and proposes strategies for mitigation, indicating potential applications in scientific computing where custom loss functions can be decomposed into several conditions.
Methodology
Sven computes parameter updates by utilizing the Moore-Penrose pseudoinverse of the loss Jacobian, approximated through truncated SVD. This approach allows for simultaneous consideration of individual data point conditions rather than aggregating them into a single loss value, enhancing the optimization process in neural networks.
Results
In experiments on regression tasks, Sven demonstrated significantly faster convergence and lower final loss compared to standard first-order methods, including Adam. It also maintained competitive performance with LBFGS while incurring lower computational costs.
Implications
Sven's methodology could revolutionize optimization in neural networks by providing a more efficient way to handle loss functions, particularly in scenarios involving complex or custom loss functions. Its potential applications extend to scientific computing, where it can be used to solve problems with multiple conditions effectively.
Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error
Reinforcement Learning
Robotics
Theory
- Introduces PQAC, a novel algorithm for robust learning in RL against noisy TD errors.
- Critiques existing heuristics for TD error stabilization, highlighting their computational inefficiencies.
- Utilizes a sigmoid function and divergence measures to derive a robust learning rule.
- Demonstrates stable learning performance in simulations, even with noisy rewards.
Read more
Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error
Summary
This paper introduces the Pseudo-Quantized Actor-Critic (PQAC) algorithm, designed to enhance the robustness of reinforcement learning (RL) against noisy temporal difference (TD) errors. Traditional TD learning methods, while effective, often suffer from instability due to the noise inherent in TD error calculations, which can destabilize learning processes. The author critiques existing heuristics, such as target networks and ensemble models, which, although they stabilize learning, introduce additional computational costs and inefficiencies. The PQAC algorithm is derived from a control as inference framework, utilizing a sigmoid function to model the distribution of optimality as a binary random variable. This approach leads to a robust learning rule that mitigates the impact of large TD errors by causing gradients to vanish when noise is present. The paper further explores the decomposition of optimality into multiple levels to achieve pseudo-quantization of TD errors, enhancing noise reduction. A Jensen-Shannon divergence-based method is also proposed to combine the benefits of different divergence measures. The effectiveness of PQAC is validated through simulations on RL benchmarks, demonstrating its ability to maintain stable learning even in the presence of noisy rewards or when traditional heuristics are less effective.
Methodology
The PQAC algorithm is developed based on control as inference, employing a sigmoid function to represent the distribution of optimality. It derives a robust learning rule that mitigates the effects of noise in TD errors by allowing gradients to vanish under certain conditions. The algorithm also incorporates Jensen-Shannon divergence to leverage the characteristics of different divergence measures for improved learning stability.
Results
The PQAC algorithm was tested on various RL benchmarks, showing superior stability and efficiency compared to baseline methods. It successfully learned robustly even when traditional heuristics were insufficient or when noise was introduced into the reward signals.
Implications
The findings suggest that the PQAC algorithm could be applied in real-world RL scenarios where noise is prevalent, such as robotics and autonomous systems, enhancing the reliability and efficiency of learning processes in uncertain environments.
Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
NLP
Large Language Models
- Introduces Care-Conditioned Neuromodulation (CCN) to enhance autonomy in supportive dialogue agents.
- Defines a utility function that balances helpfulness with the risks of dependency and coercion.
- Constructs a benchmark for evaluating relational failure modes in multi-turn dialogues.
- Demonstrates significant improvements in autonomy-preserving utility over existing alignment methods.
Read more
Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
Summary
This paper addresses the challenge of deploying large language models (LLMs) in supportive roles while ensuring user autonomy is preserved. Traditional alignment methods focus on helpfulness and harmlessness but fail to account for relational risks such as dependency and coercion. The authors propose Care-Conditioned Neuromodulation (CCN), a framework that uses a learned scalar signal based on user state and dialogue context to condition response generation and candidate selection. They formalize this as an autonomy-preserving alignment problem, introducing a utility function that rewards helpfulness and autonomy support while penalizing dependency and coercion. The authors also create a benchmark for relational failure modes in multi-turn dialogues, identifying issues like reassurance dependence and manipulative care. Empirical results show that CCN improves autonomy-preserving utility by +0.25 over supervised fine-tuning and +0.07 over preference optimization, while maintaining similar levels of supportiveness. Pilot evaluations indicate alignment with human assessments, suggesting that the proposed approach effectively balances support and user agency.
Methodology
The authors develop a state-dependent control framework (CCN) that conditions dialogue generation on structured user states and relational contexts. They define a utility function for autonomy-preserving alignment and create a benchmark for relational failure modes in dialogues. Empirical evaluations involve care-conditioned candidate generation and utility-based reranking.
Results
The CCN approach improves autonomy-preserving utility by +0.25 compared to supervised fine-tuning and +0.07 compared to preference optimization, while maintaining comparable supportiveness. Pilot human evaluations and zero-shot transfer to real emotional-support conversations show alignment with automated metrics.
Implications
The findings suggest that integrating state-dependent control and utility-based selection can enhance the design of dialogue agents in sensitive contexts, such as mental health support, by preserving user autonomy and reducing relational risks.
An Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance Analysis
Optimization
- Introduces a multi-resolution optimization framework that integrates machine learning for energy system design.
- Addresses the performance gap caused by model mismatches across different fidelity levels.
- Demonstrates a reduction in architecture-to-operation performance gap by up to 42% compared to traditional rule-based controllers.
- Achieves a 34% reduction in high-fidelity model evaluations through ML guidance.
Read more
An Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance Analysis
Summary
This paper presents an innovative online machine learning (ML) multi-resolution optimization framework aimed at enhancing the design and performance analysis of integrated energy systems. The authors address the challenge of model mismatch across different fidelity levels, which complicates the quantification of performance gaps between architecture-level designs and high-fidelity operational models. The proposed framework estimates an architecture-specific upper bound on achievable performance while minimizing the need for costly high-fidelity model evaluations. The methodology consists of a multi-objective architecture optimization to determine system configurations and component capacities, followed by a machine learning-accelerated receding-horizon optimal control strategy. This strategy effectively narrows the performance gap by adaptively scheduling optimization resolutions based on predictive uncertainties and leveraging low-fidelity solutions to warm-start high-fidelity evaluations. The framework is validated through a pilot study involving a 1 MW industrial heat load energy system, demonstrating significant improvements in performance and efficiency.
Methodology
The methodology involves a two-step approach: first, a multi-objective architecture optimization is performed to select the optimal system configuration and component capacities. Second, a machine learning-accelerated multi-resolution, receding-horizon optimal control strategy is developed to optimize operational performance, utilizing predictive uncertainty to adaptively schedule optimization resolutions and warm-start high-fidelity evaluations with elite low-fidelity solutions.
Results
The pilot case study results indicate that the proposed multi-resolution optimization strategy significantly reduces the performance gap between architecture and operation by up to 42% compared to a rule-based controller. Additionally, it reduces the number of required high-fidelity model evaluations by 34% compared to a similar multi-fidelity approach without ML guidance, enhancing the efficiency and reliability of design verification.
Implications
The findings suggest that the proposed framework can facilitate faster and more reliable design verification processes for integrated energy systems, making high-fidelity performance assessments more tractable and enabling better-informed decisions prior to deployment in industrial settings.
Cognitive Energy Modeling for Neuroadaptive Human-Machine Systems using EEG and WGAN-GP
Generative Models
Time Series
Optimization
- Introduces a framework for modeling cognitive energy dynamics using EEG and SBP.
- Validates the use of WGAN-generated synthetic EEG for cognitive state transition analysis.
- Demonstrates strong agreement in transition energies between real and synthetic EEG.
- Proposes a neuroadaptive system that adjusts behavior based on cognitive effort in real-time.
Read more
Cognitive Energy Modeling for Neuroadaptive Human-Machine Systems using EEG and WGAN-GP
Summary
This paper addresses the challenge of modeling cognitive energy dynamics in real-time using electroencephalography (EEG) data. The authors propose a novel approach that utilizes the SchrΓΆdinger Bridge Problem (SBP) to quantify the energy cost of transitions between cognitive states. They investigate whether synthetic EEG generated by Wasserstein Generative Adversarial Networks (WGAN) retains the necessary dynamical structure for effective cognitive state transition modeling. By comparing transition energies derived from real and synthetic EEG data collected during Stroop tasks, the authors demonstrate a strong agreement in the transition structures, indicating that synthetic EEG can be effectively used in neuroadaptive systems. The findings support the use of SBP-derived cognitive energy as a control signal for adaptive human-machine systems, allowing for real-time adjustments based on user cognitive and affective states, thereby enhancing interactive environments such as gameplay and interfaces.
Methodology
The authors employed the SchrΓΆdinger Bridge Problem (SBP) to measure the transport cost of cognitive state transitions, comparing real and synthetic EEG data generated by WGAN during Stroop tasks. They evaluated the distributional geometry of the EEG signals to assess the effectiveness of synthetic data in preserving the necessary dynamical structure for cognitive modeling.
Results
The study found that transition energies derived from both real and synthetic EEG exhibited strong agreement, validating that WGAN-generated EEG retains the essential structure for modeling cognitive state transitions. This suggests that synthetic EEG can be reliably used in neuroadaptive systems, facilitating real-time adjustments based on cognitive load.
Implications
The findings have significant implications for the development of adaptive human-machine systems that can respond to users' cognitive and emotional states in real-time. This could enhance user experience in various applications, including gaming, interactive interfaces, and assistive technologies, by dynamically adjusting system behavior based on cognitive effort.
Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Generative Models
Graph Learning
Efficient ML
- Introduction of Crystalite, a lightweight diffusion Transformer for crystal modeling.
- Utilization of Subatomic Tokenization for efficient atom representation.
- Development of the Geometry Enhancement Module (GEM) to inject geometric biases into attention mechanisms.
- Achievement of state-of-the-art results in crystal structure prediction and generation.
Read more
Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Summary
The paper introduces Crystalite, a lightweight diffusion Transformer designed for efficient modeling of crystalline materials. Traditional generative models for crystals often utilize equivariant graph neural networks (GNNs), which, while effective, are computationally intensive and slow. Crystalite addresses these challenges by incorporating two innovative components: Subatomic Tokenization, which replaces high-dimensional one-hot atom representations with a more compact and chemically structured form, and the Geometry Enhancement Module (GEM), which directly integrates periodic geometric information into the attention mechanism of the Transformer. This approach maintains the simplicity and efficiency of standard Transformers while enhancing their capability to model the geometric structures inherent in crystalline materials. The authors demonstrate that Crystalite achieves state-of-the-art performance on crystal structure prediction benchmarks and excels in de novo generation tasks, outperforming existing geometry-heavy methods in sampling speed. The study also explores the trade-offs between novelty, validity, and stability in generated crystal structures, providing insights into model selection based on MLIP-based stability estimates.
Methodology
Crystalite employs a diffusion Transformer architecture augmented with the Geometry Enhancement Module (GEM) to incorporate periodic geometric information directly into the attention mechanism. It replaces traditional one-hot atom representations with a compact, chemically informed tokenization, enhancing the model's efficiency and effectiveness in handling crystal structures.
Results
Crystalite achieves state-of-the-art performance on crystal structure prediction benchmarks and de novo generation tasks, attaining the highest S.U.N. discovery score among evaluated models while significantly improving sampling speed compared to geometry-heavy alternatives.
Implications
The development of Crystalite has significant implications for materials science, particularly in the discovery and design of novel crystalline materials with targeted properties. Its efficient modeling capabilities could facilitate faster exploration of the vast compositional and structural space of crystalline materials, potentially accelerating advancements in various applications such as electronics, catalysis, and pharmaceuticals.
Test-Time Scaling Makes Overtraining Compute-Optimal
NLP
Large Language Models
Optimization
- Introduces Train-to-Test (T2) scaling laws that optimize model size, training tokens, and inference samples under a fixed compute budget.
- Demonstrates that optimal pretraining strategies shift towards overtraining when considering inference costs.
- Validates the T2 scaling approach by showing improved performance of overtrained models compared to traditional scaling methods.
- Findings persist even after post-training, indicating the relevance of T2 scaling in practical deployments.
Read more
Test-Time Scaling Makes Overtraining Compute-Optimal
Summary
This paper addresses the optimization of model training and inference costs in large language models (LLMs) by introducing Train-to-Test (T2) scaling laws. Traditional pretraining scaling laws, such as Chinchilla, do not consider the inference costs associated with test-time scaling, which can lead to suboptimal model performance. The authors propose a unified framework that jointly optimizes model size, training tokens, and the number of inference samples under fixed compute budgets. Through extensive evaluations across eight downstream tasks, the study finds that optimal pretraining decisions shift towards overtraining when accounting for inference costs. The results demonstrate that models trained in the overtrained regime outperform those trained according to standard scaling laws. Additionally, the findings remain valid even after post-training, indicating the robustness of the T2 scaling approach in real-world applications.
Methodology
The authors develop T2 scaling laws that model performance as a function of model size, dataset size, and the number of inference samples. They evaluate both loss-based and accuracy-based formulations to incorporate inference costs and validate their approach through training over 100 models across various compute levels.
Results
The study finds that optimal pretraining decisions favor smaller, overtrained models when accounting for inference costs. Models trained in the predicted overtrained regime consistently outperform those trained according to Chinchilla scaling laws across multiple tasks. The results also indicate that the compute-optimal trade-offs derived from the T2 scaling approach persist after post-training.
Implications
The findings suggest that LLMs should be trained with an awareness of their intended test-time usage, potentially leading to more efficient and effective model deployments. This could influence future research and practices in model training and scaling, particularly in resource-constrained environments.
Coupled Query-Key Dynamics for Attention
NLP
Large Language Models
- Introduces Coupled QK Dynamics, enhancing attention mechanisms by evolving queries and keys jointly.
- Achieves significant reductions in perplexity on language modeling tasks with minimal additional parameters.
- Demonstrates that coupling is crucial for performance, independent of the integrator type used.
- Identifies corpus dependency of the method's effectiveness, with varying results across different datasets.
Read more
Coupled Query-Key Dynamics for Attention
Summary
This paper introduces a novel framework for attention mechanisms in language modeling, termed Coupled Query-Key (QK) Dynamics. Unlike standard attention, which computes scores from independent projections of queries and keys, the proposed method evolves these components jointly through shared learned dynamics before scoring. This coupling enhances the representational capacity of the model, leading to improved language modeling performance and training stability. The authors demonstrate that on the WikiText-103 dataset, the coupled dynamics achieve a perplexity of 22.55β22.62 at 60M parameters, outperforming the standard attention's perplexity of 24.22 by 6.6β6.9%, while only adding 0.11% more parameters. Through structural ablation studies, they confirm that the coupling is the key factor for performance improvement, as both Hamiltonian and non-symplectic (Euler) integrators yield similar results, and the number of integration steps does not significantly affect performance. The benefits of coupled dynamics are shown to be corpus-dependent, with improvements on domain-coherent texts but degradation on heterogeneous datasets. The findings provide insights into when coupling is beneficial and offer practical guidelines for its application.
Methodology
The authors propose a framework that evolves queries (Q) and keys (K) through shared learned dynamics before scoring. They utilize two types of integrators: a Hamiltonian (symplectic leapfrog) and a non-symplectic (Euler) integrator. The evolution process involves coupled updates where Q influences K and vice versa, enriching their representations. Structural ablation studies are conducted to isolate the effects of coupling from other factors.
Results
The coupled QK dynamics achieve a perplexity of 22.55β22.62 on the WikiText-103 dataset at 60M parameters, significantly better than the standard attention's 24.22. The method shows a 6.6β6.9% improvement with only a 0.11% increase in parameters. Coupling is confirmed as the key factor for performance enhancement, while the choice of integrator and the number of steps do not significantly impact the results. The benefits are corpus-dependent, with improvements on coherent texts and degradation on heterogeneous datasets.
Implications
The findings suggest that coupling queries and keys can be a powerful mechanism to enhance the performance of attention-based models, particularly in language modeling tasks. This approach could lead to more efficient training and better generalization in various NLP applications. The insights into corpus dependency also guide practitioners in selecting appropriate datasets for training.
Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
Theory
Optimization
- Introduces a diffusion-based framework for uncertainty quantification in industrial models.
- Eliminates the need for post-hoc calibration by providing intrinsically calibrated predictive uncertainty.
- Demonstrates significant improvements in uncertainty calibration and predictive accuracy over existing methods.
- Highlights the importance of reliable uncertainty measures for safety-critical industrial applications.
Read more
Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
Summary
This paper addresses the critical challenge of uncertainty quantification (UQ) in industrial data-driven models, which are essential for real-time monitoring of key performance indicators. The authors propose a novel diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty, eliminating the need for post-hoc calibration. The method is evaluated through extensive experiments on synthetic distributions, a Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study. Results demonstrate that the proposed approach significantly improves both uncertainty calibration and predictive accuracy compared to existing UQ techniques. The findings suggest that diffusion samplers can serve as a principled and scalable solution for enhancing uncertainty-aware modeling in industrial applications, thereby fostering trust and reliability in data-driven decision-making processes.
Methodology
The authors developed a diffusion-based posterior sampling framework that leverages Bayesian inference principles to produce well-calibrated predictive distributions. The method focuses on faithful posterior sampling to ensure that uncertainty estimates are reliable by construction, avoiding the common pitfalls of requiring additional calibration or ground-truth data.
Results
The proposed method was tested on various datasets, including synthetic distributions and real-world industrial scenarios. The results showed that the diffusion sampler achieved better uncertainty calibration and predictive accuracy than traditional UQ techniques, indicating its effectiveness in providing reliable uncertainty estimates for industrial applications.
Implications
The findings of this research have significant implications for the deployment of data-driven models in safety-critical industrial settings. By providing intrinsically calibrated uncertainty quantification, the proposed method enhances the reliability of predictions, thereby supporting better decision-making and risk management in industrial processes.
LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
Time Series
- LI-DSN introduces a layer-wise interactive mechanism for EEG decoding, overcoming limitations of late-fusion strategies.
- The Temporal-Spatial Integration Attention (TSIA) mechanism enables dynamic integration of spatial and temporal features.
- Extensive experiments show that LI-DSN outperforms 13 state-of-the-art models across various EEG tasks.
- The model addresses the 'information silo' problem by facilitating early-layer cross-stream communication.
Read more
LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
Summary
This paper presents LI-DSN, a novel Layer-wise Interactive Dual-Stream Network designed for effective EEG decoding. Traditional dual-stream networks process temporal and spatial features independently, leading to an 'information silo' problem that limits the integration of these features until a late fusion stage. LI-DSN addresses this issue by enabling progressive cross-stream communication at each layer of the network. The key innovation is the Temporal-Spatial Integration Attention (TSIA) mechanism, which constructs two matrices: the Spatial Affinity Correlation Matrix (SACM) to capture spatial relationships among electrodes, and the Temporal Channel Aggregation Matrix (TCAM) to integrate temporal dynamics with spatial guidance. This approach allows for adaptive fusion of features through learnable channel weights. The authors conducted extensive experiments on eight diverse EEG datasets, including tasks such as motor imagery classification, emotion recognition, and steady-state visual evoked potentials (SSVEP). The results demonstrate that LI-DSN significantly outperforms 13 state-of-the-art baseline models, showcasing enhanced robustness and decoding performance, thus highlighting its potential for advancing brain-computer interface applications.
Methodology
The LI-DSN employs a dual-stream architecture with a novel TSIA mechanism that allows for layer-wise interaction between temporal and spatial features. It constructs SACM and TCAM to capture spatial relationships and temporal dynamics, respectively, and utilizes an adaptive fusion strategy with learnable weights for optimal feature integration.
Results
LI-DSN demonstrated significant improvements in decoding performance across eight EEG datasets, outperforming 13 state-of-the-art models in tasks such as motor imagery classification, emotion recognition, and SSVEP, indicating its robustness and effectiveness in EEG decoding.
Implications
The advancements presented in LI-DSN could enhance the performance of brain-computer interfaces, leading to better applications in medical rehabilitation, assistive technologies, and cognitive assessment by providing more accurate and reliable EEG decoding.
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Reinforcement Learning
Large Language Models
Optimization
- SRPO combines the strengths of GRPO and SDPO to improve reinforcement learning efficiency.
- The framework routes samples based on their correctness, enhancing optimization focus.
- An entropy-aware mechanism helps stabilize training by emphasizing reliable signals.
- SRPO achieves significant performance gains over existing methods on multiple benchmarks.
Read more
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Summary
This paper introduces Sample-Routed Policy Optimization (SRPO), a novel framework that integrates Group Relative Policy Optimization (GRPO) and Self-Distillation Policy Optimization (SDPO) to enhance reinforcement learning with verifiable rewards (RLVR) for large language models. The authors identify limitations in GRPO's coarse credit assignment, which fails to address specific deviations in failed rollouts, and SDPO's tendency to collapse during prolonged training due to optimization ambiguity and degrading signal reliability. SRPO addresses these issues by routing correct samples to GRPO for reward-aligned reinforcement and failed samples to SDPO for targeted logit-level correction. Additionally, an entropy-aware dynamic weighting mechanism is incorporated to prioritize reliable distillation targets. Evaluated across five benchmarks, SRPO demonstrates both rapid early improvement and long-term stability, outperforming both GRPO and SDPO in terms of peak performance and computational efficiency.
Methodology
The authors propose SRPO as a unified on-policy framework that routes correct samples to GRPO for reward-based updates and failed samples to SDPO for logit-level corrections. The framework includes an entropy-aware dynamic weighting mechanism to manage the reliability of distillation targets throughout training.
Results
SRPO achieved a five-benchmark average performance of 77.4% on the Qwen3-8B model, surpassing GRPO by 3.4% and SDPO by 6.3%. For the Qwen3-4B model, it reached 74.2%, outperforming GRPO by 4.5% and SDPO by 7.5%. Additionally, SRPO reduced per-step compute costs by up to 17.2% while maintaining moderate response lengths.
Implications
The findings suggest that SRPO can significantly enhance the training efficiency and performance of large language models in reinforcement learning contexts, potentially leading to better reasoning and problem-solving capabilities in AI applications.
go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Theory
Efficient ML
Large Language Models
- Introduces go-mHC, a novel parameterization method for doubly stochastic matrices.
- Achieves a balance between expressivity and computational efficiency with a complexity of O(dΒ³).
- Demonstrates significant improvements in training stability and convergence speed.
- Validates the approach on a large-scale GPT-style language model.
Read more
go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Summary
The paper addresses the challenge of efficiently parameterizing doubly stochastic matrices, which are crucial for learned mixing in machine learning models. Existing methods either scale factorially with the number of streams or are limited in expressivity. The authors propose a novel parameterization method based on generalized orthostochastic matrices, termed go-mHC, which operates with a complexity of O(dΒ³) and introduces a hyperparameter that allows for interpolation between efficiency and expressivity. This method is integrated into the Manifold-Constrained Hyper-Connections (mHC) framework, enhancing the dynamic connectivity of layers in neural networks. The authors demonstrate that go-mHC significantly improves the expressivity of the Birkhoff polytope while maintaining computational efficiency. Experimental results show that go-mHC achieves the minimum theoretical loss on synthetic tasks and converges up to 10 times faster than existing methods. Additionally, validation on a 30M parameter GPT-style language model indicates that go-mHC provides a practical approach for scaling model capacity through increased residual streams.
Methodology
The authors develop a parameterization pipeline that maps learned skew-symmetric parameters to orthogonal matrices using the Cayley Transform, followed by a projection to obtain a doubly stochastic matrix. This method is grounded in the algebraic theory of generalized orthostochastic matrices and integrates seamlessly with existing Kronecker-factorized methods.
Results
go-mHC achieves the minimum theoretical loss on synthetic stream-mixing tasks and converges up to 10 times faster than traditional methods. Spectral analysis shows that go-mHC fills the Birkhoff polytope more completely than Kronecker-factorized approaches, indicating enhanced expressivity.
Implications
The proposed method has significant implications for improving the performance and scalability of deep learning models, particularly in applications requiring dynamic layer connectivity and efficient parameterization of complex matrix structures.
Forecasting Supply Chain Disruptions with Foresight Learning
NLP
Large Language Models
Time Series
- Introduces a novel forecasting task linking real-time news to future supply chain disruptions.
- Develops an end-to-end modeling approach that trains LLMs directly on raw news inputs.
- Achieves superior predictive performance compared to pretrained models and baselines.
- Induces structured and decision-relevant reasoning behavior in the model.
Read more
Forecasting Supply Chain Disruptions with Foresight Learning
Summary
This paper addresses the challenge of forecasting supply chain disruptions by introducing an end-to-end framework that leverages large language models (LLMs) to produce calibrated probabilistic forecasts. The authors highlight the difficulties in predicting infrequent, high-impact events from noisy and unstructured inputs, which traditional models struggle to handle without specific adaptations. The proposed framework trains LLMs using realized disruption outcomes as supervision, allowing the model to directly generate probabilistic forecasts from raw news inputs. This approach not only improves predictive performance but also enhances the model's ability to reason about uncertainty and prioritize relevant signals without additional prompting. The study demonstrates that the new forecasting task, which links real-time news to future disruptions, significantly outperforms existing models, including GPT-5, in terms of accuracy, calibration, and precision. The authors also provide an open-source evaluation dataset to support transparency and further research.
Methodology
The authors employ an end-to-end training framework based on Foresight Learning, which formulates forecasting as supervised learning using temporally aligned text and future outcomes. The model is trained to produce probabilistic forecasts directly from timestamped news articles and disruption indexes, focusing on one-month-ahead predictions of significant disruption increases.
Results
The proposed model demonstrates substantial improvements over strong baselines, including lower Brier scores, reduced calibration error, and higher precision. The training process enhances the model's probabilistic reasoning capabilities, allowing it to handle uncertainty more effectively and prioritize relevant signals.
Implications
The findings suggest a pathway for developing domain-specific forecasting models that can provide decision-ready signals for supply chain management. This approach could be applied to various industries facing similar forecasting challenges, potentially improving resilience against disruptions.
Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
Theory
Efficient ML
Optimization
- Introduces feature weighting in distance computation for active learning in regression.
- Proposes five new active learning approaches that incorporate feature weights.
- Demonstrates consistent performance improvements over existing unweighted methods.
- Validates the effectiveness of feature weighting across both single-task and multi-task regression problems.
Read more
Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
Summary
This paper addresses the challenge of pool-based sequential active learning for regression (ALR), which aims to select a small number of samples from a large pool of unlabeled data to improve regression model accuracy within a limited labeling budget. The author identifies that existing ALR methods often neglect the importance of feature weighting in distance computations, leading to sub-optimal sample selection. To remedy this, the paper proposes three feature-weighted single-task ALR approaches (FW-RD, FW-GSx, FW-iGS) and two multi-task approaches (FW-MT-GSx, FW-MT-iGS). These methods utilize ridge regression coefficients derived from previously labeled samples to weight features during distance calculations. Extensive experiments demonstrate that these feature-weighted approaches consistently outperform their unweighted counterparts across various regression tasks, indicating that incorporating feature importance can significantly enhance active learning performance. The findings suggest that this feature weighting strategy is not only effective for regression but can also be extended to stream-based active learning and classification tasks.
Methodology
The paper develops three feature-weighted single-task ALR approaches and two multi-task ALR approaches. It employs ridge regression to estimate feature weights based on a small set of labeled samples, which are then used to compute distances among samples for improved sample selection. The proposed methods are tested against existing ALR techniques to evaluate their performance.
Results
The experimental results indicate that all five proposed feature-weighted ALR approaches outperform their unweighted versions in terms of regression accuracy. The improvements are consistent across different regression models, both linear and nonlinear, showcasing the robustness of the feature weighting strategy.
Implications
The findings suggest that incorporating feature weighting can significantly enhance the efficiency of active learning in regression tasks, making it a valuable approach for scenarios where labeling is costly or time-consuming. This methodology could be applied to various fields, including affective computing and oil and gas production, where obtaining labeled data is challenging.
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
Reinforcement Learning
Robotics
Efficient ML
- WAV enables world models to self-improve by verifying their own prediction errors.
- The framework decomposes state prediction into state plausibility and action reachability.
- WAV utilizes action-free data from videos to enhance verification processes.
- Empirical results show a 2Γ increase in sample efficiency and an 18% boost in policy performance.
Read more
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
Summary
The paper introduces the World Action Verifier (WAV), a novel framework designed to enhance the robustness of world models in reinforcement learning by enabling them to self-improve through an asymmetric forward-inverse cycle. Traditional world models struggle to predict future states accurately across a wide range of suboptimal actions due to limited action-labeled data. WAV addresses this by decomposing the action-conditioned state prediction into two components: state plausibility and action reachability. The authors leverage the availability of action-free data from video corpora to verify state plausibility and utilize a sparse inverse model to infer actions from a subset of state features for action reachability. This dual verification approach allows for improved sample efficiency and policy performance. The framework is empirically validated across nine tasks, demonstrating a 2Γ increase in sample efficiency and an 18% improvement in downstream policy performance compared to existing methods. The results suggest that exploiting the asymmetries between forward and inverse dynamics can significantly enhance the capabilities of self-improving world models.
Methodology
The methodology involves a framework that integrates a diverse subgoal generator from video data and a sparse inverse model to infer actions. The verification process is structured around cycle consistency among generated subgoals, inferred actions, and forward rollouts, allowing the model to identify and correct its own prediction errors effectively.
Results
WAV was tested across nine tasks in environments like MiniGrid, RoboMimic, and ManiSkill, achieving a 2Γ increase in sample efficiency and improving downstream policy performance by over 18% compared to existing methods.
Implications
The findings suggest that self-improving world models can be developed more effectively by leveraging available data asymmetries, which could lead to advancements in robotic learning and other applications requiring robust predictive models.
PAC-Bayesian Reward-Certified Outcome Weighted Learning
Theory
- Introduces PROWL, a framework that incorporates reward uncertainty into ITR estimation.
- Establishes a certified reduction that transforms policy learning into a cost-sensitive classification problem.
- Derives a nonasymptotic PAC-Bayes lower bound for randomized ITRs.
- Proposes an automated calibration procedure for learning-rate selection.
Read more
PAC-Bayesian Reward-Certified Outcome Weighted Learning
Summary
This paper introduces PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL), a novel framework aimed at improving the estimation of individualized treatment rules (ITRs) in the presence of reward uncertainty. Traditional outcome weighted learning (OWL) methods often overlook the noise and optimism in observed rewards, leading to inflated performance estimates. PROWL addresses this by incorporating a one-sided uncertainty certificate to construct conservative rewards and policy-dependent lower bounds on true expected values. The authors prove a certified reduction that reformulates robust policy learning into a cost-sensitive classification task, allowing for the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs. The optimal posterior for this bound is characterized by a general Bayes update. To tackle the learning-rate selection issue in Bayesian inference, the authors propose an automated calibration procedure and a Fisher-consistent certified hinge surrogate for optimization. Experimental results demonstrate that PROWL significantly enhances the estimation of robust, high-value treatment regimes under severe reward uncertainty compared to standard ITR estimation methods.
Methodology
The methodology involves the development of PROWL, which utilizes a PAC-Bayesian framework to handle reward uncertainty. It constructs conservative rewards and policy-dependent lower bounds, reformulates policy learning into a classification task, and employs a calibration procedure for optimizing learning rates.
Results
The experiments show that PROWL outperforms standard methods for estimating individualized treatment rules, particularly in scenarios with significant reward uncertainty, leading to more accurate and robust treatment recommendations.
Implications
The findings suggest that PROWL can enhance personalized medicine by providing more reliable treatment recommendations, potentially improving clinical outcomes through better-informed decision-making in the presence of uncertain rewards.
MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
NLP
Large Language Models
Efficient ML
- MiCA focuses on adapting underutilized subspaces of model representations, unlike conventional methods that target dominant subspaces.
- The method uses Singular Value Decomposition to identify minor singular vectors for parameter updates.
- MiCA achieves up to 5.9x improvement in knowledge acquisition with a reduced parameter footprint of 6-60% compared to LoRA.
- The approach minimizes catastrophic forgetting and enhances knowledge retention in large language models.
Read more
MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
Summary
The paper introduces Minor Component Adaptation (MiCA), a novel parameter-efficient fine-tuning method for large language models (LLMs) that focuses on adapting underutilized subspaces of model representations. Unlike traditional methods like Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA employs Singular Value Decomposition (SVD) to identify and constrain updates to minor singular vectors associated with the least significant singular values. This approach results in a significant improvement in knowledge acquisition, achieving up to 5.9 times better performance under optimized training hyperparameters while maintaining a minimal parameter footprint of 6-60% compared to LoRA. The authors argue that this method offers a more efficient and stable mechanism for integrating new knowledge into pre-trained language models, minimizing the risk of catastrophic forgetting and maximizing knowledge retention. The paper also discusses the experimental setup, datasets, and comparative methods used to validate MiCA's effectiveness, highlighting its performance across various tasks and benchmarks.
Methodology
MiCA employs Singular Value Decomposition to identify minor singular directions in the weight matrix of pre-trained language models. It constrains the parameter updates during fine-tuning to these minor components, allowing for efficient adaptation while preserving the majority of the model's pre-trained weights.
Results
The experimental results indicate that MiCA significantly outperforms LoRA in terms of knowledge acquisition, achieving improvements of up to 5.9 times while maintaining a lower parameter footprint. This demonstrates the effectiveness of constraining adaptation to minor singular directions for integrating new knowledge.
Implications
MiCA's approach could lead to more efficient fine-tuning methods for large language models, making it easier to adapt these models to specific tasks without incurring high computational costs. This could have applications in various domains requiring rapid model adaptation and knowledge retention.
Dual-Attention Based 3D Channel Estimation
Theory
Optimization
Efficient ML
- Introduction of a dual-attention mechanism for channel estimation in MIMO systems.
- Theoretical foundation for optimal 3D channel estimation derived from 5G-NR systems.
- 3DCENet outperforms traditional and existing deep learning-based channel estimation methods.
- Significant reduction in mean-square-error (MSE) by leveraging channel correlations across multiple domains.
Read more
Dual-Attention Based 3D Channel Estimation
Summary
This paper addresses the challenge of optimal channel estimation (CE) in multi-input and multi-output (MIMO) systems, particularly under the constraints of high computational complexity associated with three-dimensional (3D) filtering. The authors propose a novel deep learning-based approach, termed 3DCENet, which utilizes a dual-attention mechanism to effectively capture channel correlations across time, frequency, and spatial domains. The paper begins with a theoretical derivation of the optimal 3DCE based on the demodulation reference signal (DMRS) in 5G-NR systems, followed by an analysis of optimal noise-power allocation across different domains. The proposed 3DCENet integrates least-square estimates from various antenna ports and resource elements, enhancing the filtering process through spatial attention (SA) and time-frequency attention (TFA). The results demonstrate that 3DCENet significantly outperforms traditional CE methods, achieving up to 4dB improvement in mean-square-error (MSE) compared to conventional 2D and (2D+1D) based estimators, as well as existing deep learning models like SRCNN and EDSR.
Methodology
The authors derive the optimal 3D channel estimation using linear minimum mean square error (LMMSE) principles and propose a deep learning architecture (3DCENet) that employs dual attention mechanisms to enhance channel estimation accuracy. The model processes input features from multiple antenna ports and resource elements, applying spatial and time-frequency attention to improve filtering operations.
Results
Numerical simulations indicate that 3DCENet achieves a mean-square-error (MSE) reduction of up to 4dB compared to conventional genie LLNet and 2D estimators. It also shows superior performance over SRCNN and EDSR based methods, confirming the effectiveness of the proposed dual-attention approach in channel estimation.
Implications
The proposed method has significant implications for the design of next-generation wireless communication systems, particularly in enhancing the performance of MIMO systems in 6G networks. The dual-attention mechanism could be adapted for other applications requiring high-dimensional data processing and correlation analysis.