AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
46
Papers today
8h
Update frequency
7
Days of history
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Theory
Large Language Models
Multimodal
- Foundation models face unique OOD challenges that differ from classical OOD assumptions.
- A stage-aware formalization of OOD is necessary to account for multi-stage training distributions.
- Model-centric methods have intrinsic limitations, leading to a 'parameter coverage ceiling' for certain inputs.
- Agentic systems extend the capabilities of FMs by integrating perception and external strategies.
Read more
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Summary
This paper argues that out-of-distribution (OOD) generalization in foundation models (FMs) presents unique challenges that cannot be adequately addressed by traditional model-centric approaches. The authors propose that agentic systems, which incorporate perception, strategy selection, external action, and closed-loop verification, represent a necessary paradigm shift for tackling these challenges. The paper outlines a stage-aware formalization of OOD that acknowledges the complexities of multi-stage training distributions and the limitations of parameter-based representations. The authors demonstrate that there are inputs that model-centric methods cannot handle due to intrinsic limitations, termed the 'parameter coverage ceiling.' They characterize agentic OOD systems and argue for their complementarity to existing methods, suggesting a new research agenda that recognizes the importance of agentic paradigms in FM-OOD contexts.
Methodology
The authors present a formalization of OOD that incorporates multiple training stages and partially observed distributions. They prove the existence of a parameter coverage ceiling, demonstrating that certain inputs cannot be addressed by model-centric methods. The characteristics of agentic OOD systems are defined and discussed, along with responses to counterarguments regarding the proposed paradigm shift.
Results
The paper establishes that traditional model-centric methods are insufficient for handling OOD scenarios encountered by foundation models. It proves that there are relevant inputs that cannot be effectively managed within the constraints of parameter-based representations. The characterization of agentic systems shows that they can overcome these limitations, suggesting that a dual approach combining both paradigms is essential for future advancements.
Implications
The findings suggest that integrating agentic systems into the development of foundation models could significantly enhance their performance in open-world applications. This paradigm shift could lead to improved OOD generalization, making FMs more robust in real-world scenarios, such as medical AI and autonomous systems.
No Triangulation Without Representation: Generalization in Topological Deep Learning
Graph Learning
Theory
- Extension of the MANTRA dataset to include a wider variety of manifold triangulations.
- Demonstration that GNNs and HOMP methods can saturate the benchmark with the right representations.
- Introduction of a novel evaluation protocol focusing on representational diversity and triangulation refinement.
- Findings indicate that existing models fail to generalize beyond combinatorial structures, highlighting a gap in TDL.
Read more
No Triangulation Without Representation: Generalization in Topological Deep Learning
Summary
This paper addresses the lack of consensus on evaluating topological deep learning (TDL) models, particularly those targeting higher-order datasets. The authors extend the MANTRA benchmark dataset, which consists of manifold triangulations, to include a broader range of manifolds with diverse homeomorphism types. They demonstrate that both graph neural networks (GNNs) and higher-order message passing (HOMP) methods can achieve high performance on this benchmark, but only when provided with the appropriate representation and feature assignment. The authors introduce a new evaluation protocol that emphasizes representational diversity and triangulation refinement. Their findings reveal that existing models struggle to generalize beyond the combinatorial structure of the data, indicating a significant gap in developing models that can understand topological structures independently of scale. The work lays the groundwork for future evaluations and the development of topology-aware inductive biases in machine learning.
Methodology
The authors extended the MANTRA dataset by incorporating a fully-characterized set of triangulations in dimensions 2 and 3. They utilized Pachner moves as data augmentation and evaluation tools. The impact of different representations and encodings on predictive performance was systematically investigated, and refinement schemes for triangulations were introduced as a generalization stress test.
Results
The study found that while GNNs and HOMP methods can achieve high performance on the extended MANTRA benchmark, they do so only under specific representations. All models tested failed to generalize across triangulation refinements, suggesting that they do not learn genuine topological structures but rather exploit combinatorial artifacts.
Implications
The findings suggest a need for new models that can effectively learn and generalize topological structures, independent of their combinatorial representations. This could lead to advancements in topological data analysis and applications in fields such as applied mathematics and physics.
Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations
Robotics
Time Series
Theory
- Introduction of Lagrangian Gaussian Processes (LGPs) for learning dynamics models.
- Preservation of the geometric structure of the Lagrange-d’Alembert principle.
- Ability to learn from discrete position data without requiring velocity or momentum measurements.
- Demonstrated data efficiency and generalization in synthetic and real-world scenarios.
Read more
Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations
Summary
This paper introduces Lagrangian Gaussian Processes (LGPs) aimed at learning dynamics in a probabilistic and data-efficient manner through discrete forced Euler-Lagrange equations. The authors emphasize the importance of preserving the geometric structure of the Lagrange-d’Alembert principle, which governs the motion of dynamical systems, particularly in scenarios without external forces. This preservation allows for the development of physically consistent models that mitigate energy drift, thus ensuring stable long-term predictions. The methodology centers around linear operators for Gaussian process conditioning derived from discrete forced Euler-Lagrange equations, enabling the learning of dynamics solely from discrete position snapshots, which is crucial for applications where only positional data is available, such as motion capture and visual servoing. The authors validate their approach through various synthetic and real-world experiments, including a soft robot exhibiting hysteresis, demonstrating that LGPs can learn physically consistent dynamics with uncertainty quantification from sparse positional data, leading to reliable long-term predictions.
Methodology
The authors develop two schemes for LGPs: a discrete version and a continuous version, both leveraging discrete forced Euler-Lagrange linear operators. These methods allow for the learning of dynamics models characterized by the system's Lagrangian and external force functions, using only position data. The approach incorporates a normalization condition to avoid restrictions on kernel choices and enables the integration of additional energy model knowledge when available.
Results
The experimental results indicate that the LGPs effectively learn physically consistent dynamics and provide uncertainty quantification from sparse positional data. The validation includes successful applications in synthetic environments and a real-world soft robot, showcasing the models' ability to maintain energy conservation and deliver stable long-term predictions.
Implications
The proposed LGP framework has significant implications for fields requiring robust modeling of dynamical systems, such as robotics, control systems, and motion analysis. By enabling learning from limited data, it opens avenues for more efficient and reliable applications in soft robotics and other areas where only positional data is available.
WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers
Optimization
- The evaluation baseline for warm-start methods has been corrected, revealing that previous claims of iteration reductions were misleading.
- Primal prediction accuracy is anticorrelated with convergence speed in interior-point methods.
- Providing complete primal-dual information significantly reduces solver iterations compared to primal-only methods.
- The authors release a benchmark suite and a new model (WARP) that effectively predicts the full interior-point state.
Read more
WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers
Summary
This paper addresses the challenges in solving the AC Optimal Power Flow (AC-OPF) problem using interior-point methods (IPMs) like IPOPT, which are crucial for electricity market operations. Previous works have reported significant reductions in solver iterations (30-46%) by using machine learning to predict primal warm-start iterates. However, the authors argue that these results are based on an inappropriate evaluation baseline, specifically the flat start (Vm = 1, Va = 0), rather than the actual default initialization at the variable-bound midpoint, which is near-optimal. The authors demonstrate that no primal-only warm-start method reduces iterations when evaluated against this corrected baseline. They identify a geometric property of IPMs where primal prediction accuracy is negatively correlated with convergence speed, leading to divergence when the ground-truth optimal solution is provided without dual variables. Through oracle experiments, they show that providing the complete primal-dual-barrier state can reduce IPOPT iterations by 85%. To facilitate rigorous evaluation of warm-start methods, the authors introduce a benchmark suite with dual-labeled AC-OPF datasets and a new evaluation protocol. They present WARP, a topology-conditioned encode-process-decode interaction network that predicts the full interior-point state, achieving a 76% reduction in IPOPT iterations while accommodating topology variations without retraining.
Methodology
The authors conducted systematic experiments to diagnose the failure of primal-only warm-start methods, including ablation studies on prediction techniques and the necessity of dual information. They developed WARP, an interaction network that predicts the full interior-point state, leveraging topology-aware features.
Results
Against a corrected evaluation baseline, no primal-only warm-start method reduced solver iterations. In contrast, the complete primal-dual state reduced IPOPT iterations from 23 to 3 (an 85% reduction). WARP achieved a 76% reduction in iterations while accommodating topology variations without retraining.
Implications
The findings suggest that future research on warm-start methods for interior-point solvers should incorporate dual information for effective performance. The benchmark suite and WARP model can serve as valuable tools for evaluating and improving warm-start strategies in power system optimization.
Crafting Reversible SFT Behaviors in Large Language Models
NLP
Large Language Models
Interpretability
- Introduces the concept of sparse behavioral carriers for controlling SFT-induced behaviors in LLMs.
- Proposes Loss-Constrained Dual Descent (LCDD) for constructing these carriers through joint optimization.
- Demonstrates the effectiveness of SFT-Eraser in reversing SFT behaviors without weight modification.
- Provides evidence that the sparse structure is essential for causal control over behaviors.
Read more
Crafting Reversible SFT Behaviors in Large Language Models
Summary
This paper addresses the challenge of controlling behaviors induced by supervised fine-tuning (SFT) in large language models (LLMs). Traditional methods for interpreting behaviors in LLMs often identify correlations between model components and behaviors post-hoc, which does not guarantee causal relationships. The authors propose a novel approach to construct sparse, mechanistically necessary subnetworks, termed 'carriers', that encapsulate SFT-induced behaviors and can be controlled at inference time without modifying model weights. They introduce two key methodologies: Loss-Constrained Dual Descent (LCDD), which optimizes routing masks and model weights under a utility budget to create these carriers, and SFT-Eraser, a soft prompt designed to reverse SFT behaviors through activation matching. The results demonstrate that the carriers effectively preserve target behaviors while allowing for strong reversibility when triggered by SFT-Eraser. The findings indicate that the sparse structure of the carriers is crucial for the reversal of behaviors, providing evidence that these carriers are causally necessary for the behaviors they encapsulate. This work opens new avenues for the systematic localization and selective suppression of SFT-induced behaviors in deployed models.
Methodology
The authors developed Loss-Constrained Dual Descent (LCDD) to create sparse carriers by optimizing routing masks and model weights under a utility constraint. They also introduced SFT-Eraser, which uses activation matching to reverse SFT-induced behaviors, confirming the necessity of the sparse structure for effective control.
Results
The experiments showed that LCDD successfully constructs sparse carriers that maintain target behaviors while enabling effective reversibility through SFT-Eraser across various behavior types and model families. Ablation studies confirmed that the sparse structure is critical for achieving the desired behavioral control.
Implications
This research has significant implications for enhancing the interpretability and controllability of large language models, allowing for more modular post-training adjustments. It could lead to improved safety and alignment in AI systems by enabling precise behavioral control without the need for retraining.
Hypothesis generation and updating in large language models
Large Language Models
Theory
Interpretability
- LLMs generate and update hypotheses based on sparse numerical examples, revealing their inductive biases.
- LLMs often exhibit a two-parameter Bayesian fit but show systematic biases towards narrower hypotheses.
- There is a significant gap between hypothesis evaluation and generation performance in LLMs.
- LLMs generalize poorly to unobserved parts of the hypothesis domain, indicating limitations in their reasoning capabilities.
Read more
Hypothesis generation and updating in large language models
Summary
This paper investigates how large language models (LLMs) generate and update hypotheses based on limited numerical examples, using a controlled setting known as the number game. The study aims to understand the inference capabilities of LLMs and how closely they align with optimal Bayesian reasoning. The author employs three complementary probes—posterior prediction, hypothesis evaluation, and hypothesis generation—to measure the posterior over hypotheses. The findings reveal that LLMs often exhibit a two-parameter Bayesian fit but with systematic biases, such as a strong-sampling assumption that favors narrower hypotheses. Additionally, there is a notable evaluation-generation gap where LLMs perform better during hypothesis evaluation than in generating hypotheses. The results indicate that LLMs struggle with generalization beyond the observed examples, highlighting limitations in their ability to serve as general problem solvers, particularly in scientific inference contexts.
Methodology
The study utilizes the number game, a controlled experimental setting where learners infer hypotheses from a few positive integer examples. It employs three measurement probes to assess LLM behavior: posterior prediction, hypothesis evaluation, and hypothesis generation, comparing results with an optimal Bayesian model and human behavior.
Results
The results indicate that LLM predictions align closely with a simple Bayesian model, but with biases that lead to a preference for narrower hypotheses. The evaluation probe shows a stronger inclination towards hypotheses that include all observed examples, while the generation probe produces simpler, rule-like hypotheses. Furthermore, LLMs demonstrate poor generalization when queried about a broader domain than what they have observed.
Implications
These findings have significant implications for the development of LLMs as general problem solvers, particularly in scientific contexts where hypotheses must extend beyond the data. Understanding these limitations can guide future research in enhancing LLM inference capabilities and aligning them more closely with Bayesian reasoning.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
Optimization
Theory
Generative Models
- AeroJEPA introduces a novel predictive latent architecture for aerodynamic modeling.
- The architecture separates the prediction of latent representations from the resolution of output fields.
- It demonstrates competitive performance on high-fidelity aerodynamic datasets.
- The learned latent space enables effective interpolation and design optimization.
Read more
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
Summary
The paper introduces AeroJEPA, a Joint-Embedding Predictive Architecture designed for scalable 3D aerodynamic field modeling. Traditional aerodynamic surrogate models face challenges in scaling to large fields and providing useful latent representations for analysis and design. AeroJEPA addresses these issues by predicting a target latent representation of the flow from a context latent representation of the geometry and operating conditions, rather than directly predicting the full flow field. This approach allows for decoupling the prediction from field resolution and encourages a semantically organized latent space. The authors evaluate AeroJEPA on two datasets: HiLiftAeroML, which tests high-fidelity performance with large boundary-layer fields, and SuperWing, which assesses generalization and latent-space optimization across various transonic wing designs. The results demonstrate that AeroJEPA is competitive as a continuous surrogate for aerodynamic fields, scales effectively to high-resolution outputs, and learns context and predicted latents that encapsulate geometry and aerodynamic quantities. The latent space supports controlled interpolation, linear probing, and design optimization, indicating that predictive latent learning is a promising avenue for aerodynamic surrogate modeling.
Methodology
AeroJEPA employs a Joint-Embedding Predictive Architecture that predicts target latent representations of flow fields from context latent representations of geometry and operating conditions. It includes a continuous implicit decoder for reconstructing fields at arbitrary locations, allowing for flexible and scalable modeling.
Results
AeroJEPA was evaluated on two datasets, showing competitive performance as a continuous surrogate for aerodynamic fields. It effectively captures geometry and aerodynamic information, supports smooth latent space interpolation, and facilitates constrained design-space searches.
Implications
The findings suggest that AeroJEPA can significantly enhance the efficiency of aerodynamic design processes by providing scalable and meaningful latent representations, which can be leveraged for optimization and analysis without the need for extensive computational resources.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Time Series
- Synthetic data augmentation is architecture-dependent, benefiting channel-mixing models while degrading performance in channel-independent models.
- In low-resource settings, synthetic data can significantly enhance model performance, particularly with TimesNet.
- The Seasonal-Trend generator is the most effective synthetic data method across tested benchmarks.
- Hard curriculum switching is detrimental, leading to increased mean squared error (MSE).
Read more
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Summary
This paper investigates the role of synthetic data in time series forecasting, an area where its impact is not well understood compared to its success in language models. The authors conduct a large-scale empirical study involving 4,218 runs across nine experimental groups, five forecasting architectures, four types of synthetic signals, and seven datasets. The findings reveal that the effectiveness of synthetic data augmentation is highly dependent on the architecture used. Channel-mixing models, such as TimesNet and iTransformer, generally benefit from synthetic data, while channel-independent models like DLinear and PatchTST tend to perform worse when synthetic data is introduced. In low-resource scenarios, TimesNet trained with synthetic data on only 10% of the Weather dataset outperforms the baseline trained on the full dataset in several cases. However, across all architectures, synthetic augmentation negatively impacts performance in 67% of trials. The study also identifies the Seasonal-Trend generator as the most reliable method for improving performance and warns against the use of hard curriculum switching, which can lead to significant degradation in model performance. The paper concludes with actionable guidelines for effectively utilizing synthetic data in time series forecasting, emphasizing the importance of architecture and dataset characteristics.
Methodology
The authors developed a controllable synthetic time series generator organized around a bundle system that defines the temporal characteristics of each channel. They employed difficulty conditioning and cross-channel correlation controls to modulate complexity and inter-variable structure. The study systematically evaluated the impact of synthetic data across various architectures and datasets through extensive experimentation.
Results
The results indicate that synthetic data augmentation is beneficial for channel-mixing architectures but harmful for channel-independent models. In low-resource scenarios, TimesNet with synthetic data outperformed the full-data baseline in 4 out of 16 cases. Overall, augmentation negatively affected performance in 67% of trials, highlighting the importance of careful selection of synthetic data methods.
Implications
The findings suggest that synthetic data can be a powerful tool for enhancing time series forecasting, particularly in low-resource environments. However, practitioners must consider the architecture and dataset characteristics to avoid potential performance degradation. The guidelines provided can help in optimizing the use of synthetic data in real-world applications.
RVPO: Risk-Sensitive Alignment via Variance Regularization
Reinforcement Learning
Large Language Models
Optimization
- RVPO addresses constraint neglect in multi-objective RLHF by penalizing inter-reward variance.
- The LogSumExp operator is shown to effectively act as a smooth variance penalty.
- RVPO improves adherence to critical constraints while avoiding late-stage training degradation.
- The framework is validated across two distinct multi-objective paradigms with significant performance improvements.
Read more
RVPO: Risk-Sensitive Alignment via Variance Regularization
Summary
The paper introduces Reward-Variance Policy Optimization (RVPO), a novel framework designed to address the issue of constraint neglect in multi-objective reinforcement learning from human feedback (RLHF) methods. Traditional critic-less RLHF approaches, such as Group Relative Policy Optimization (GRPO) and Group Decoupled Policy Optimization (GDPO), aggregate rewards using arithmetic means, which can mask critical failures in low-magnitude constraints by offsetting them with high-magnitude successes in other objectives. RVPO shifts the focus from merely maximizing the sum of rewards to maximizing consistency by penalizing inter-reward variance during advantage aggregation. The authors demonstrate that the LogSumExp (SoftMin) operator can serve as a smooth variance penalty, effectively improving adherence to bottleneck constraints. The framework is evaluated on two paradigms: rubric-based medical and scientific reasoning with multiple LLM-judged reward signals and rule-based tool-calling with specific constraints. Results indicate that RVPO significantly enhances overall performance on HealthBench while maintaining accuracy on GPQA-Diamond, thus mitigating the risk of constraint neglect without sacrificing general capabilities.
Methodology
The authors propose RVPO, which incorporates a variance penalty into the reward aggregation process. This is achieved through the use of the LogSumExp (SoftMin) operator, which allows for a smooth interpolation between mean and min aggregation. The framework is empirically validated using multiple concurrent reward signals in two different settings: rubric-based evaluations and rule-based tool-calling.
Results
RVPO outperformed GDPO on HealthBench, achieving a score of 0.261 compared to GDPO's 0.215 (p < 0.001) at the 14B model scale. It also maintained competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating its effectiveness in improving constraint adherence and overall model performance.
Implications
The findings suggest that RVPO can be a powerful tool for enhancing the alignment of large language models (LLMs) in multi-objective settings, particularly in applications where strict adherence to certain constraints is critical, such as in medical or scientific reasoning tasks. This approach could lead to more reliable and robust AI systems capable of balancing competing objectives effectively.
Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models
Interpretability
- Utilized nationwide EHR data to enhance CRS prediction accuracy.
- Developed a hybrid feature-selection method to condense clinical codes.
- Implemented demographic-stratified models to capture variations in disease presentation.
- Achieved an AUC of 0.8461, improving prediction discrimination.
Read more
Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models
Summary
This paper addresses the challenge of predicting Chronic Rhinosinusitis (CRS) using nationwide electronic health record (EHR) data from the All of Us Research Program. CRS is a prevalent inflammatory disorder that is often misdiagnosed due to symptom overlap with other conditions. The authors developed a hybrid feature-selection pipeline that reduced over 110,000 clinical codes to 100 interpretable features, enhancing the model's performance. They implemented demographic-stratified models tailored to six adult sex and life-stage subgroups, recognizing the significant variations in CRS presentation across different demographics. The study achieved an overall AUC of 0.8461, indicating improved discrimination in predicting CRS compared to baseline models. This research highlights the potential of EHR data in supporting early diagnosis and triage in clinical settings, especially when imaging is not accessible.
Methodology
The authors leveraged longitudinal EHR data from the All of Us Research Program, employing a hybrid feature-selection pipeline that combined statistical prevalence-based screening with model-based importance ranking. They trained demographic-stratified models across six adult sex and life-stage subgroups, allowing for tailored hyperparameter tuning and improved predictive performance.
Results
The proposed framework achieved an overall AUC of 0.8461, which is an improvement of 0.0168 over the best baseline model. The demographic-specific modeling approach effectively captured the heterogeneity in CRS risk factors, leading to better discrimination in predictions.
Implications
This research suggests that routinely collected EHR data can be effectively utilized for population-representative CRS risk stratification, potentially facilitating earlier diagnosis and prioritization of referrals in primary care settings, particularly when diagnostic imaging is not available.
Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Optimization
Reinforcement Learning
Graph Learning
- Introduces Node-Edge Policy Factorization (NEPF) for scalable routing on multigraphs.
- Utilizes a pre-encoding edge aggregation scheme to reduce memory and computational costs.
- Employs a non-autoregressive architecture for efficient edge selection.
- Demonstrates superior performance in solution quality and speed compared to existing methods.
Read more
Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Summary
This paper addresses the limitations of existing neural methods for Vehicle Routing Problems (VRPs) that primarily focus on Euclidean settings or simple graphs. The authors propose a novel approach for routing on multigraphs, where multiple edges represent different travel options with varying attributes. To tackle scalability issues associated with existing methods, they introduce a Node-Edge Policy Factorization (NEPF) that separates the routing policy into two stages: a node permutation stage and an edge selection stage. This decomposition is facilitated by a pre-encoding edge aggregation scheme that compresses parallel edges into a latent representation, significantly reducing memory usage. Additionally, a lightweight non-autoregressive architecture is employed for the edge selection stage, and both stages are trained jointly using hierarchical reinforcement learning. Experimental results across six VRP variants demonstrate that NEPF not only matches but often outperforms state-of-the-art methods in solution quality while being significantly faster in training and inference. This work represents a significant step towards scalable and efficient routing solutions in complex transportation networks.
Methodology
The authors propose a two-stage learned decomposition approach where the routing solution is factorized into a node permutation and edge selection. A pre-encoding edge aggregation mechanism is introduced to summarize parallel edges, avoiding the need for a dense multigraph representation during encoding. A non-autoregressive architecture is designed for the edge selection stage, and both stages are trained jointly using hierarchical reinforcement learning.
Results
The NEPF approach matches or outperforms state-of-the-art methods across six VRP variants, achieving significant improvements in both training and inference speed. The proposed framework demonstrates strong compatibility with various node-based backbones, indicating its potential for broader applications in routing models.
Implications
This work has significant implications for real-world transportation networks, where multigraphs are common. The NEPF framework can facilitate more efficient routing solutions, potentially leading to cost reductions and improved logistics in various applications, including delivery services and public transportation systems.
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
Reinforcement Learning
Theory
Robotics
- AdaGamma introduces a practical implementation of state-dependent discounting in deep RL.
- The method includes a return-consistency objective to stabilize learning and prevent TD-error collapse.
- Empirical results show consistent improvements in performance on continuous-control tasks.
- AdaGamma was successfully validated in a real-world online A/B test on the JD Logistics platform.
Read more
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
Summary
The paper introduces AdaGamma, a novel deep actor-critic method that implements state-dependent discounting in reinforcement learning (RL). Traditional RL methods typically use a fixed discount factor, which can be limiting in environments with varying temporal structures. AdaGamma addresses this by learning a state-dependent discount function while incorporating a return-consistency objective to stabilize learning and prevent issues like TD-error collapse. The authors analyze the Bellman operator associated with state-dependent discounting, demonstrating its well-posedness under certain conditions. AdaGamma is integrated into both Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) frameworks, showing consistent performance improvements on continuous-control benchmarks. Additionally, the method was validated in a real-world application on the JD Logistics platform, where it achieved statistically significant enhancements over standard SAC. The findings suggest that adaptive discounting can effectively enhance decision-making in RL by tailoring temporal propagation to the specific context of each state.
Methodology
AdaGamma employs a deep actor-critic architecture that learns a state-dependent discount function. It integrates a return-consistency objective to regularize the learning process, ensuring that the discounting mechanism does not lead to unstable learning or manipulation of TD targets. The method is instantiated within existing frameworks like SAC and PPO, adapting their return estimation and optimization processes accordingly.
Results
The implementation of AdaGamma in both SAC and PPO resulted in statistically significant performance improvements on standard continuous-control benchmarks. In a four-week online A/B test on the JD Logistics platform, AdaGamma outperformed the standard SAC baseline, demonstrating its effectiveness in real-world applications.
Implications
The findings suggest that state-dependent discounting can enhance reinforcement learning algorithms, particularly in environments with heterogeneous temporal structures. This approach could lead to more robust and efficient decision-making in various applications, including robotics, recommendation systems, and other domains requiring adaptive learning strategies.
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
Time Series
Interpretability
Generative Models
- MOSAIC integrates identifiable causal learning with support recovery for enhanced interpretability in scientific time series.
- The framework employs a sparse temporal VAE with an additive decoder to clarify the influence of latent variables.
- Theoretical guarantees for the recovery of identifiable supports are provided, ensuring robustness in various applications.
- Empirical results demonstrate successful recovery of interpretable latent mechanisms across multiple scientific domains.
Read more
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
Summary
The paper introduces MOSAIC, a novel framework for causal representation learning (CRL) tailored for scientific time series data. Traditional CRL methods often achieve identifiability of latent variables but fail to ensure interpretability, particularly in scientific contexts where the underlying mechanisms are unknown. MOSAIC addresses this gap by integrating identifiable causal learning with support recovery over observed variables, allowing for the identification of latent variables through regime-conditioned temporal variation. The framework employs a sparse temporal variational autoencoder (VAE) with an additive decoder that delineates the influence of each latent variable on a sparse set of associated observations, thereby enhancing interpretability. The authors provide theoretical guarantees for the population recovery of identifiable supports and demonstrate the effectiveness of MOSAIC through empirical evaluations across various scientific domains, including RNA molecular dynamics and climate data. The results indicate that MOSAIC successfully recovers interpretable latent mechanisms, thus facilitating a deeper understanding of complex scientific phenomena.
Methodology
MOSAIC utilizes a sparse temporal variational autoencoder (VAE) that combines causal representation learning with an additive decoder. This architecture allows for the identification of latent variables based on regime-conditioned temporal variations and recovers a sparse set of associated observations, enhancing the interpretability of the latent mechanisms. The framework also incorporates a parallel transition prior to improve scalability and efficiency in high-dimensional data contexts.
Results
MOSAIC successfully identifies and recovers interpretable latent mechanisms in various scientific time series datasets, including RNA molecular dynamics, solar wind data, and climate indices. The empirical evaluations show that the framework can recover domain-consistent variable groups and provides finite-sample recovery guarantees, demonstrating its effectiveness in translating identifiable latent variables into meaningful scientific interpretations.
Implications
The development of MOSAIC has significant implications for scientific research, particularly in fields where understanding the underlying mechanisms of observed phenomena is crucial. By providing interpretable latent representations, MOSAIC can enhance data-driven discoveries and facilitate the analysis of complex systems in biology, climate science, and other domains.
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on ℓ1-norm Lower Bounds
Optimization
Theory
Efficient ML
- SignSGD achieves superior convergence rates compared to SGD under specific conditions, particularly in the presence of sparse noise.
- The paper establishes tight ℓ1-norm lower bounds for SignSGD, providing a clear characterization of its performance.
- The theoretical framework is extended to matrix optimization, showing that the advantages of sign-based methods persist in higher dimensions.
- Empirical validation demonstrates the practical benefits of SignSGD, aligning theoretical predictions with observed performance in large-scale model training.
Read more
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on ℓ1-norm Lower Bounds
Summary
This paper investigates the theoretical foundations of sign-based optimization algorithms, specifically SignSGD, and their advantages over traditional stochastic gradient descent (SGD). The authors identify that while SGD is known to be minimax optimal under standard smoothness and variance conditions, it does not account for the unique characteristics of sign-based methods. By shifting the focus from ℓ2-norms to ℓ1-norm stationarity and ℓ∞-smoothness, the authors derive matched upper and lower bounds for SignSGD, demonstrating that it can outperform SGD in scenarios with sparse noise. They also extend their analysis to matrix optimization, establishing similar bounds for the Muon optimizer. The theoretical insights are validated through empirical results showing faster convergence of SignSGD during the pretraining of a 124M parameter GPT-2 model, suggesting that the theoretical framework aligns well with practical outcomes.
Methodology
The authors analyze sign-based optimizers by employing ℓ1-norm stationarity, ℓ∞-smoothness, and a separable noise model. They derive upper and lower bounds for SignSGD and SGD under these new conditions, focusing on the geometry of the problem space that aligns with the characteristics of sign-based updates.
Results
The study finds that SignSGD can reduce the complexity of optimization by a factor of d (the problem dimension) under sparse noise conditions. The theoretical bounds for SignSGD are shown to match its empirical performance, particularly in the context of training a large GPT-2 model, where it converges faster than SGD.
Implications
The findings suggest that sign-based optimization methods like SignSGD can be more effective in training large models, particularly in environments with high-dimensional data and sparse noise. This could lead to more efficient training protocols in machine learning applications, especially in deep learning and large-scale model pretraining.
Accelerating LMO-Based Optimization via Implicit Gradient Transport
Optimization
Theory
Efficient ML
- Introduction of LMO-IGT, a new class of stochastic LMO-based optimization methods.
- Development of a unified framework for analyzing stochastic LMO-based methods.
- Introduction of the regularized support function (RSF) as a new stationarity measure.
- Theoretical improvement in iteration complexity for LMO-IGT compared to existing methods.
Read more
Accelerating LMO-Based Optimization via Implicit Gradient Transport
Summary
This paper introduces LMO-IGT, a novel class of stochastic optimization methods that leverage implicit gradient transport (IGT) to enhance the performance of linear minimization oracle (LMO)-based optimizers. The authors address the limitations of existing LMO-based methods, which often require additional gradient evaluations for variance reduction, leading to increased computational costs. By proposing a unified framework for stochastic LMO-based optimization, they introduce a new stationarity measure called the regularized support function (RSF) that integrates gradient-norm and Frank–Wolfe-gap concepts. The LMO-IGT method accelerates convergence while maintaining the efficiency of standard stochastic LMO methods by evaluating stochastic gradients at transported points. The theoretical analysis shows that LMO-IGT achieves an iteration complexity of O(ε−3.5) with only a single gradient evaluation per iteration, outperforming traditional stochastic LMO methods (O(ε−4)) and variance-reduced LMO methods (O(ε−3)). Empirical results demonstrate that LMO-IGT, particularly its instantiation Muon-IGT, consistently outperforms its counterparts with minimal overhead, showcasing its effectiveness in large-scale optimization scenarios.
Methodology
The authors propose LMO-IGT, which utilizes implicit gradient transport to construct momentum from gradients evaluated at lookahead points. They develop a unified framework that encompasses stochastic LMO, variance-reduced LMO, and LMO-IGT, allowing for a comprehensive analysis of convergence properties across different optimization settings.
Results
The theoretical analysis establishes that LMO-IGT achieves an iteration complexity of O(ε−3.5) with only one gradient evaluation per iteration, while traditional stochastic LMO achieves O(ε−4) and variance-reduced LMO achieves O(ε−3) at the cost of additional evaluations. Empirical results indicate that LMO-IGT consistently outperforms its stochastic LMO counterparts, with Muon-IGT demonstrating the best performance across various settings.
Implications
The findings suggest that LMO-IGT can significantly enhance the efficiency of optimization algorithms used in deep learning and other high-dimensional non-convex problems, making it a valuable tool for training large-scale neural networks without incurring high computational costs.
Enabling Federated Inference via Unsupervised Consensus Embedding
Federated Learning
Computer Vision
Time Series
- CE-FI enables federated inference without sharing raw inputs or model parameters.
- The framework consists of a Consensus Embedding layer and a Cooperative Output layer.
- CE-FI outperforms solo inference and matches conventional methods under non-IID conditions.
- The approach is applicable beyond image classification, including text and time-series tasks.
Read more
Enabling Federated Inference via Unsupervised Consensus Embedding
Summary
The paper introduces Consensus Embedding-based Federated Inference (CE-FI), a novel framework designed to facilitate cooperative inference among independently deployed machine learning models without the need for sharing input data or model parameters. This approach is particularly relevant in privacy-sensitive environments where data confidentiality is paramount. CE-FI comprises two main components: a Consensus Embedding (CE) layer that aligns heterogeneous intermediate representations into a common embedding space, and a Cooperative Output (CO) layer that generates predictions based on these embeddings. Both layers are trained using shared unlabeled data, eliminating the requirement for additional labeled datasets during the cooperative inference stage. The authors conducted experiments on image classification tasks using CIFAR-10 and CIFAR-100 datasets under various non-IID conditions, demonstrating that CE-FI consistently outperforms solo inference and achieves performance comparable to conventional methods that necessitate stronger sharing assumptions. The framework's versatility is further supported by evaluations on text and time-series tasks, although performance varies based on the ensemble strategy employed. The study identifies representation alignment as a critical bottleneck in the cooperative inference process.
Methodology
The CE-FI framework employs a Consensus Embedding layer to map heterogeneous intermediate representations into a shared embedding space, allowing for knowledge exchange among different models. The Cooperative Output layer then reconstructs predictions from these consensus embeddings. Both layers are trained using shared unlabeled data, facilitating cooperation without the need for additional labeled data.
Results
Experimental results indicate that CE-FI consistently outperforms solo inference methods and performs comparably to traditional federated inference techniques that require stronger assumptions about data sharing. The framework demonstrated effectiveness across various tasks, including image classification, text, and time-series analysis, although performance was influenced by the ensemble strategy used.
Implications
The CE-FI framework has significant implications for privacy-sensitive applications, such as healthcare and finance, where sharing raw data or model parameters is restricted. It allows organizations to leverage existing models collaboratively while maintaining data confidentiality, potentially enhancing inference performance across diverse domains.
Contrastive Identification and Generation in the Limit
Theory
- Introduces contrastive identification and generation in the limit, focusing on relational data.
- Presents an exact characterization of contrastive identifiable classes and a new combinatorial dimension.
- Demonstrates a reversal under finite adversarial corruption, highlighting robustness in contrastive learning.
- Establishes a common crossing graph to analyze learning challenges in contrastive settings.
Read more
Contrastive Identification and Generation in the Limit
Summary
This paper introduces the concepts of contrastive identification and generation in the limit, expanding upon classical models of learning from positive examples. The authors address the limitations of existing models that rely solely on positive or fully labeled data by proposing a framework where learners observe unordered pairs of examples that disagree under an unknown target hypothesis. The study presents three main results in a noiseless setting: an exact characterization of contrastive identifiable classes, the introduction of a combinatorial dimension called contrastive closure dimension, and a characterization of uniform contrastive generation with tight sample complexity. The authors also demonstrate a significant reversal under finite adversarial corruption, showing that certain classes are identifiable from contrastive pairs even with corruption, while they may not be identifiable from positive examples. The paper emphasizes the importance of relational data in learning and introduces a common crossing graph as a unifying technical object to analyze pairwise ambiguity and generation obstructions.
Methodology
The authors develop a theoretical framework to analyze contrastive identification and generation by defining a contrastive presentation of data as unordered pairs of examples that disagree under an unknown target hypothesis. They utilize combinatorial techniques and geometric interpretations to derive conditions for learnability and to introduce the common crossing graph as a central analytical tool.
Results
The paper provides an exact combinatorial condition for contrastive identification, establishing that a hypothesis class must be text-identifiable and have overlapping covers for its hypotheses. It also introduces a closure-style combinatorial dimension for uniform contrastive generation, characterizing which classes can generate novel positives from contrastive data. Additionally, it reveals that certain classes can be identified from contrastive pairs under adversarial corruption, which is not possible with positive examples alone.
Implications
The findings suggest that contrastive learning frameworks can be more robust to noise and corruption, making them suitable for applications involving relational data such as A/B testing and comparative analysis. This work could influence future research in learning theory and practical applications in machine learning where relational data is prevalent.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
Reinforcement Learning
Large Language Models
NLP
- Reframes reasoning RL as internalizing outcome supervision into process supervision.
- Introduces the IOP framework for automatic generation of process-level signals.
- Demonstrates improved policy optimization through failure repair during RL.
- Achieves 4.9-6.9% accuracy improvement and 2.3x sample efficiency over existing methods.
Read more
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
Summary
This paper addresses a significant challenge in reasoning reinforcement learning (RL), which is the conversion of sparse outcome feedback into detailed learning signals for intermediate reasoning steps. The authors propose a novel framework called IOP (Internalized Outcome Process), which allows models to automatically generate process-level learning signals by identifying and correcting failed reasoning trajectories. This approach enables finer-grained policy optimization based solely on outcome supervision, overcoming the limitations of existing methods that rely on externally constructed process supervision. The paper formalizes this concept into a training paradigm where the model continuously refines its internal process supervision during RL, enhancing credit assignment for intermediate reasoning steps. The IOP framework includes mechanisms such as audit gating, minimum-edit repair, and verification-based adaptive truncation, leading to improved performance in reasoning tasks. Experiments demonstrate that IOP significantly outperforms traditional methods, achieving higher accuracy and sample efficiency without the need for external annotations.
Methodology
The authors propose the IOP framework, which includes mechanisms for audit gating, minimum-edit repair, and verification-based adaptive truncation. This framework allows the model to convert sequence-level rewards into token-level gating signals, enabling it to generate and refine internal process supervision based on outcome feedback alone.
Results
The IOP framework, particularly its instantiation IOP-GSPO, consistently outperformed existing methods, achieving an average accuracy improvement of 4.9-6.9% and approximately 2.3 times the sample efficiency across three reasoning benchmarks.
Implications
The proposed paradigm could significantly enhance the efficiency and effectiveness of reinforcement learning in reasoning tasks, making it applicable in areas such as natural language processing, automated reasoning systems, and complex decision-making scenarios where intermediate steps are crucial.
When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
Theory
- Majority voting can exhibit nonmonotonic behavior under heterogeneous latent correctness distributions.
- The voting curve is equivalent to a signed voting signature that captures the latent correctness distribution.
- Different voting behaviors can arise from simple latent mixtures, challenging the notion of 'more votes always help'.
- The study separates two estimation regimes: direct access to per-example success probabilities versus finite repeat-depth grouped labels.
Read more
When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
Summary
This paper investigates the dynamics of majority voting as a method for enhancing the performance of fixed stochastic predictors in machine learning. It challenges the classical fixed-competence theory, which posits that more votes monotonically improve outcomes above a majority threshold and worsen them below it. The author introduces a nuanced perspective using the de Finetti representation for exchangeable correctness, revealing that voting behavior is influenced by a latent distribution of correctness probabilities. The study demonstrates that even simple mixtures of latent distributions can lead to diverse voting curves, including nonmonotonic behaviors and infinitely many trend changes. The paper establishes that the voting curve is equivalent to a signed voting signature, which captures the excess latent mass above the majority threshold. This insight allows for a deeper understanding of various phenomena related to voting curves, including their shapes and the implications of different estimation regimes. The findings suggest that majority voting can yield unexpected results, emphasizing the importance of understanding the underlying correctness distributions in practical applications.
Methodology
The author employs a theoretical framework based on the de Finetti representation for exchangeable correctness, analyzing the voting behavior of binary classifiers under different latent distributions of correctness probabilities. The study utilizes mathematical constructs such as signed Hausdorff moments to establish the relationship between voting curves and latent distributions.
Results
The main results demonstrate that majority voting can produce a variety of voting curves, including constant, fast-decreasing, and infinitely reversing trends, depending on the underlying correctness distributions. The paper proves that the complete odd-budget voting curve is equivalent to a signed voting signature, which uniquely captures the latent correctness information while ignoring branch-symmetric nuisance components.
Implications
The findings have significant implications for the design and evaluation of machine learning models, particularly in scenarios where repeated access to a model's predictions is feasible. Understanding the complexities of voting behavior can inform better strategies for performance extraction in stochastic prediction systems, leading to improved accuracy without the need for model modifications.
Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks
Theory
Efficient ML
- Purely low-rank neural networks suffer from 'orthogonal blindness,' limiting their expressivity for function approximation.
- The introduction of a minimal sparse diagonal component in DLoR structures enables universal approximation.
- The Structural Correspondence framework allows for effective decomposition of full-rank transformations into low-rank components.
- DLoR networks can achieve better parameter-to-expressivity scaling through multiplicative depth compared to additive width.
Read more
Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks
Summary
This paper addresses the limitations of low-rank neural networks, particularly their inability to perform well in multi-dimensional function approximations due to a phenomenon termed 'orthogonal blindness.' The authors introduce a novel framework called Structural Correspondence, which combines low-rank layers with a minimal sparse diagonal component, resulting in a Diagonal plus Low-Rank (DLoR) structure. This approach allows for universal approximation capabilities without the need for dense matrices or specific activation functions. The paper demonstrates that DLoR networks can reconstruct full-rank transformations through either additive or multiplicative decompositions, effectively expanding network width or depth. The authors provide theoretical proofs that validate the DLoR structure's ability to overcome the limitations of purely low-rank networks, thereby restoring the Universal Approximation Theorem for general activation functions. Their findings suggest that the parameter efficiency of neural networks can be significantly improved while maintaining expressivity, which is crucial for deploying large models in resource-constrained environments.
Methodology
The authors analyze the limitations of strictly low-rank neural networks and propose a DLoR structure that combines low-rank components with a sparse diagonal. They provide theoretical proofs to establish the structural correspondence between network width and depth, demonstrating how DLoR networks can achieve universal approximation without relying on dense priors or specific activation functions.
Results
The paper proves that DLoR networks can reconstruct any full-rank transformation and fully restore the Universal Approximation Theorem for general activation functions. It also shows that the multiplicative depth of DLoR structures offers superior scaling in terms of parameter efficiency compared to additive width.
Implications
The findings have significant implications for the design of efficient neural network architectures, particularly in scenarios where computational resources are limited. The DLoR framework can enhance the performance of large models in practical applications, making them more accessible for deployment in various domains.
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
NLP
Large Language Models
Theory
- Joint training of steering factors and directions eliminates the need for post-hoc factor selection.
- Prompt-Only Steering Vectors (PrOSV) outperform traditional full-sequence steering vectors (FSSVs).
- PrOSV achieves a better balance between model utility and adversarial robustness.
- Optimal initialization sizes and learning rates are crucial for effective joint training.
Read more
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
Summary
This paper addresses the limitations of current steering vector (SV) approaches for controlling large language models (LLMs). Traditional fine-tuned SVs require careful selection of steering factors, which can be brittle and lead to quality degradation during inference. The authors propose a joint training method for steering factors and directions, eliminating the need for post-hoc factor selection. They introduce the Prompt-Only Steering Vector (PrOSV), which intervenes only on a few prompt tokens rather than the entire sequence, thus preserving model utility while achieving effective steering. The paper leverages neural network scaling theory to establish optimal initialization sizes and learning rates for the joint training process. Empirical results demonstrate that PrOSV outperforms traditional full-sequence SVs (FSSVs) in terms of steering effectiveness and robustness against adversarial attacks, while maintaining better overall model utility. This work provides a more principled framework for SV training, enhancing the practicality of SVs in real-world applications.
Methodology
The authors propose a joint training approach for steering factors and directions, utilizing neural network scaling theory to determine optimal initialization sizes and learning rates. They introduce the Prompt-Only Steering Vector (PrOSV), which focuses on intervening at the prompt stage rather than the entire sequence, thereby minimizing the impact on model generation quality.
Results
Empirical evaluations show that PrOSV significantly outperforms traditional FSSVs on the AXBENCH benchmark, achieving effective concept-based steering without sacrificing model utility. Additionally, PrOSV demonstrates improved robustness to concept suppression attacks and maintains effectiveness over extended contexts.
Implications
The findings suggest that steering vectors can be a more effective tool for controlling LLM behaviors, with potential applications in areas requiring reliable and interpretable model interventions. This work could lead to advancements in the deployment of LLMs in sensitive applications where model reliability and control are paramount.
Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies
Reinforcement Learning
Theory
Robotics
- Introduces a new framework for offline reinforcement learning based on hitting time observations.
- Proves the existence and uniqueness of a Hilbert-space displacement geometry for controlled Markov processes.
- Develops Isomorphic Embedding Learning (IEL) as a goal-agnostic foundation policy learning algorithm.
- Demonstrates that IEL improves upon existing methods in offline maze locomotion tasks.
Read more
Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies
Summary
This paper introduces a novel operator-theoretic framework for offline reinforcement learning that leverages hitting time observations to recover the directed temporal geometry of controlled Markov processes. Unlike existing methods that produce symmetric distances or violate the triangle inequality, the proposed framework learns a Hilbert-space displacement geometry where expected hitting times function as linear functionals of latent displacements. The authors demonstrate that this representation is uniquely identifiable under certain conditions and provide finite-sample guarantees that account for various sources of error. They present Isomorphic Embedding Learning (IEL), a goal-agnostic foundation policy learning algorithm that integrates hitting-time regression with a consistency objective, enabling robust multi-stage planning for long-horizon navigation tasks. Experimental results indicate that IEL outperforms state-of-the-art methods in learning foundation policies from offline maze locomotion data, showcasing its effectiveness in real-world applications.
Methodology
The authors utilize an operator-theoretic approach to define a Hilbert-space representation of hitting times, establishing a relationship between expected hitting times and latent displacements. They prove theoretical results regarding the uniqueness and existence of this representation and develop the IEL algorithm that incorporates hitting-time regression to ensure consistency in learned policies.
Results
The framework allows for the successful recovery of the directed temporal geometry of controlled Markov processes, with IEL achieving superior performance in offline maze locomotion tasks compared to existing methods. The theoretical guarantees provided ensure that the learned representations are robust and applicable to multi-stage planning.
Implications
The proposed framework and IEL algorithm have significant implications for reinforcement learning, particularly in scenarios where offline data is abundant. It enables the development of policies that can generalize across various tasks without explicit goal conditioning, making it suitable for real-world applications in robotics and navigation.
Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning
Theory
Efficient ML
- Introduction of a self-supervised mechanism for adaptive loss balancing in PINNs.
- Integration of transfer learning to enhance efficiency in scientific machine learning tasks.
- Validation on a challenging heat transfer problem with limited data, achieving significant performance improvements.
- Framework provides a general recipe for embedding physics adaptively into neural networks.
Read more
Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning
Summary
This paper presents a novel framework for physics-informed neural networks (PINNs) that incorporates a self-supervised mechanism for adaptive loss balancing between physics-based and data-driven supervision. Unlike traditional PINNs, which often rely on fixed or heuristic weighting, this approach introduces a learnable blending neuron that dynamically adjusts the contributions of physics residuals and data losses based on their uncertainties. This innovation enhances training stability and generalization capabilities without the need for manual tuning. Additionally, the framework integrates transfer learning to leverage representations from related domains, facilitating the adaptation to new physical systems with limited data. The proposed method is validated through the prediction of heat transfer in sodium-cooled miniature heat sinks, utilizing only 87 computational fluid dynamics (CFD) data points. The adaptive PINN demonstrates an error rate of less than 8%, outperforming conventional shallow neural networks, kernel methods, and physics-only baselines. This work offers a robust and reproducible approach for addressing data-scarce problems across various scientific fields, including fluid dynamics and material modeling.
Methodology
The methodology involves developing a self-supervised PINN framework with a learnable blending neuron for dynamic loss balancing, alongside a transfer learning strategy that reuses hidden-layer representations from related domains. The approach is validated through CFD simulations to generate a dataset for training and testing.
Results
The adaptive PINN framework achieved an error rate of less than 8% in predicting heat transfer in sodium-cooled miniature heat sinks, significantly outperforming traditional models such as shallow neural networks and kernel methods.
Implications
The proposed framework has broad implications for scientific machine learning, particularly in fields where data is scarce. It provides a robust method for integrating physical principles into neural networks, which can be applied to various domains including fluid dynamics, materials science, and aerospace engineering.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
NLP
Large Language Models
Optimization
- Introduction of DAPRO, a dynamic budget allocation framework for multi-turn LLM evaluation.
- Theoretical guarantees for budget constraints and coverage without requiring conditional independence assumptions.
- Demonstrated lower variance and improved coverage rates compared to static budget allocation methods.
- Applicability of the framework to various safety and utility evaluation tasks in LLMs.
Read more
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
Summary
This paper addresses the challenge of evaluating large language models (LLMs) in multi-turn conversational settings, particularly focusing on the computational expense associated with predicting rare events like jailbreaks or task completions. The authors introduce Dynamic Allocation via PRojected Optimization (DAPRO), a novel framework that dynamically allocates computational budgets for evaluating LLMs, overcoming the limitations of static budget allocation methods. DAPRO is theoretically validated to satisfy budget constraints while providing distribution-free, finite-sample coverage guarantees. The framework allows for adaptive censoring times during conversations, which enhances the efficiency of resource utilization and minimizes variance in performance estimates. The authors demonstrate the effectiveness of DAPRO through comprehensive experiments across various tasks, showing that it consistently achieves better coverage rates and lower variance compared to static methods, thereby improving the reliability of safety evaluations in LLMs.
Methodology
The authors developed DAPRO, which treats budget allocation as a sequential decision-making process. It dynamically adjusts censoring times during multi-turn interactions while ensuring compliance with a global budget constraint. The framework leverages a novel coverage bound that scales with the mean censoring weight, providing tighter guarantees than previous methods.
Results
Experiments conducted using LLMs such as Llama 3.1 and Qwen 2.5 showed that DAPRO achieved coverage rates closer to the nominal level with significantly lower variance compared to static allocation baselines. This indicates that DAPRO is more effective in resource utilization and provides more reliable estimates of performance metrics like the jailbreak rate.
Implications
The findings suggest that DAPRO can be a valuable tool for evaluating the safety and utility of LLMs in real-world applications, particularly in scenarios where computational resources are limited. By improving the reliability of performance evaluations, DAPRO could enhance the deployment of LLMs in sensitive domains such as healthcare and education.
SAT: Sequential Agent Tuning for Coordinator-Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
Large Language Models
Reinforcement Learning
Efficient ML
- Introduction of Sequential Agent Tuning (SAT) for decentralized multi-LLM training.
- Theoretical guarantees for monotonic improvement and plug-and-play invariance.
- Empirical results show SAT-trained teams outperform larger models on benchmarks.
- Demonstration of effective agent upgrades without retraining the entire team.
Read more
SAT: Sequential Agent Tuning for Coordinator-Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
Summary
The paper introduces Sequential Agent Tuning (SAT), a novel training paradigm for multi-large language models (LLMs) that operates without a central coordinator. SAT aims to address the challenges of training multiple smaller LLMs, which can collectively outperform a single large model, by employing a factorized policy representation and block-coordinate updates. This decentralized approach allows for scalable training while ensuring stability and monotonic improvement during the training process. The authors develop a sequence-aware, on-policy advantage estimator that adapts to the evolving team policy and incorporates per-agent KL trust regions to manage occupancy drift. The theoretical framework guarantees monotonic improvement and plug-and-play invariance, allowing for the seamless upgrade of individual agents without retraining the entire team. Empirical results demonstrate that a team of three 4B agents trained with SAT outperforms a larger 32B model on benchmark tasks, and further improvements are observed when swapping in stronger agents. The paper provides a comprehensive theoretical analysis and empirical validation of the proposed method, contributing significantly to the field of multi-agent LLM systems.
Methodology
The methodology involves a coordinator-free training paradigm utilizing a factorized policy representation and block-coordinate updates. A sequence-aware, on-policy advantage estimator is employed, along with per-agent KL trust regions to mitigate occupancy drift. The theoretical framework is established through rigorous analysis of sequential agent training, ensuring monotonic improvement and plug-and-play capabilities.
Results
The empirical evaluation shows that a team of three 4B agents trained with SAT surpasses the performance of a single 32B model by an average of 3.9% on the AIME24/25 benchmarks. Additionally, swapping in two 8B agents resulted in a 10.4% improvement in the composite score, validating the plug-and-play theory.
Implications
The findings suggest that teams of smaller, efficient LLMs can be effectively trained to match or exceed the performance of larger models, making them more deployable in resource-limited scenarios. The plug-and-play capability allows for flexible upgrades and improvements in multi-agent systems, which could have significant applications in various domains requiring adaptive and scalable AI solutions.
Retain-Neutral Surrogates for Min-Max Unlearning
Optimization
Theory
Efficient ML
- Introduction of Retain-Orthogonal Surrogate Unlearning (ROSU) for effective min-max unlearning.
- ROSU constrains the inner perturbation to maximize forget gain while maintaining retain neutrality.
- Theoretical analysis shows improved performance under positive alignment of gradients.
- Empirical results demonstrate ROSU's advantages across multiple datasets, especially in high-coupling scenarios.
Read more
Retain-Neutral Surrogates for Min-Max Unlearning
Summary
This paper addresses the challenge of machine unlearning, which aims to remove the influence of specific training data while maintaining performance on the remaining data. The authors introduce Retain-Orthogonal Surrogate Unlearning (ROSU), a method that optimizes the inner perturbation used in min-max unlearning by constraining it to be retain-neutral. This approach maximizes the first-order forget gain while ensuring that the retain objective remains unchanged under a fixed perturbation budget. The paper establishes a theoretical framework that includes a curvature-controlled second-order bound on retain damage and demonstrates that ROSU can lead to significant improvements in scenarios where the forget and retain gradients are closely aligned. Empirical evaluations on various vision and language benchmarks reveal that ROSU outperforms standard methods, particularly in high-coupling regimes, while remaining competitive in other settings.
Methodology
The authors propose a constrained optimization approach to construct the inner perturbation for unlearning, ensuring that it is retain-neutral. This involves maximizing the linearized forget gain while keeping the retain objective stationary. The method yields a closed-form solution for the perturbation and includes a relaxed transported outer update with a deviation bound.
Results
ROSU shows significant improvements in retain loss reduction compared to standard min-max perturbations across various benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, TOFU, and WMDP. The method is particularly effective in high-coupling regimes where the forget and retain gradients are closely aligned.
Implications
The findings suggest that ROSU can enhance the efficiency of machine unlearning processes, making it a valuable tool for applications requiring data privacy and compliance, such as in federated learning and other machine learning scenarios where data removal is necessary.
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Optimization
Theory
Efficient ML
- OpenG2G is an open-source library for simulating AI datacenter-grid runtime coordination.
- The platform supports various control strategies, allowing for standardized evaluation of their impacts on coordination outcomes.
- OpenG2G captures metrics from both AI datacenters and power systems, facilitating comprehensive analysis.
- The simulation reveals trade-offs between AI operational metrics and grid performance, aiding in design decision-making.
Read more
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Summary
The paper introduces OpenG2G, a simulation platform designed to address the challenges posed by the increasing energy demands of AI datacenters on the electricity grid. As AI workloads grow, they create significant capacity and reliability issues for the grid, leading to delays in datacenter interconnections and hindering AI progress. OpenG2G allows users to simulate AI datacenter-grid runtime coordination by implementing and comparing various control paradigms, including classical, optimization, and learning-based controllers. The platform features a modular architecture that integrates a datacenter backend based on real measurements, a grid backend utilizing high-fidelity simulators, and a flexible controller interface. The authors demonstrate OpenG2G's capabilities through realistic scenarios, showcasing how different AI model and deployment choices impact datacenter flexibility and coordination outcomes. The platform aims to bridge the gap between the systems, machine learning, and grid engineering communities, providing a unified framework for exploring the coordination problem and informing design decisions for AI datacenter projects.
Methodology
The authors developed OpenG2G by creating a simulation loop that integrates three pluggable components: a datacenter backend driven by real AI service measurements, a grid backend based on traditional grid simulators, and a generic controller interface. This architecture allows for the implementation of various control paradigms and the exploration of different AI workloads and grid configurations.
Results
The results demonstrate that OpenG2G can effectively simulate the coordination between AI workloads and grid operations. The authors implemented and compared classical feedback controllers and learning-based controllers, revealing how different designs affect coordination outcomes. Additionally, they quantified the impact of AI model and deployment choices on the datacenter's feasible power range, highlighting the potential for runtime coordination.
Implications
OpenG2G has significant implications for the design and operation of AI datacenters, enabling stakeholders to understand and optimize the interaction between AI workloads and grid operations. It can inform actionable design decisions, potentially leading to more efficient energy use and reduced delays in datacenter buildouts.
Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)
Theory
- Weak-to-strong generalization can occur in linear logistic regression without requiring a mismatch in model capacity.
- The phenomenon is nearly inevitable under mild distributional assumptions, suggesting broad applicability.
- The study challenges existing theoretical beliefs about the necessity of model capacity differences for generalization.
- Empirical observations align with theoretical findings, reinforcing the robustness of the phenomenon.
Read more
Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)
Summary
This paper investigates the phenomenon of weak-to-strong generalization in machine learning, particularly in the context of linear logistic regression. The authors demonstrate that a strong student model can improve its performance beyond that of a weaker teacher model, even when the student model is not more expressive or capable than the teacher. This challenges the conventional belief that a mismatch in model capacity is necessary for such generalization to occur. The authors provide theoretical evidence that weak-to-strong generalization is almost inevitable under mild distributional assumptions, suggesting that this phenomenon is not limited to complex models but can also be observed in simpler settings. Their findings extend to various data distributions, indicating a broader applicability of weak-to-strong generalization in machine learning.
Methodology
The authors analyze the weak-to-strong generalization phenomenon through theoretical proofs in the context of logistic regression. They establish conditions under which this generalization occurs, focusing on the relationship between the student and teacher models and the data distribution. The methodology includes defining supervised finetuning processes and deriving results based on the properties of the models involved.
Results
The authors show that weak-to-strong generalization is almost inevitable when the student and teacher models are sampled from a natural ensemble, provided certain conditions are met. Specifically, they demonstrate that a student model can consistently improve its performance using feedback from a weaker teacher, even when the teacher's guidance is suboptimal.
Implications
The findings suggest that weaker models can effectively guide the training of stronger models, which could lead to more efficient training methodologies in machine learning. This has potential applications in scenarios where high-capacity models are impractical or costly to train, allowing for the use of simpler models to enhance performance.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
Multimodal
Large Language Models
Efficient ML
- MACS introduces a novel Entropy-Weighted Load mechanism to address information heterogeneity in visual tokens.
- The Dynamic Modality-Adaptive Capacity mechanism allows real-time allocation of expert resources based on input composition.
- The proposed methods significantly improve inference efficiency in MoE MLLMs compared to existing approaches.
- The paper systematically analyzes the straggler effect in multimodal contexts, highlighting the unique challenges faced.
Read more
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
Summary
The paper addresses the efficiency bottleneck in Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) during Expert Parallelism (EP) inference, primarily caused by the straggler effect. This issue is exacerbated in multimodal contexts due to two main challenges: Information Heterogeneity, where redundant visual tokens are treated equally to critical ones, and Modality Dynamics, where varying visual to text ratios lead to resource misallocation. To tackle these challenges, the authors propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework that introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens and a Dynamic Modality-Adaptive Capacity mechanism to allocate expert resources based on the real-time modal composition of inputs. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a robust solution for efficient MoE MLLM deployment in EP inference.
Methodology
The methodology involves a training-free inference framework that employs two main mechanisms: an Entropy-Weighted Load mechanism to assess the semantic value of visual tokens and a Dynamic Modality-Adaptive Capacity mechanism that adjusts expert resource allocation based on the input's modal composition. Additionally, a two-phase overflow handling mechanism is designed to minimize information loss during capacity overflows.
Results
The results indicate that MACS significantly outperforms existing methods in terms of efficiency and performance on various multimodal benchmarks, effectively addressing the straggler effect and improving load balancing in MoE MLLMs.
Implications
The findings suggest that MACS can enhance the efficiency of multimodal large language models, making them more viable for real-time applications in diverse fields such as computer vision, natural language processing, and multimodal data analysis.
Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
Reinforcement Learning
Theory
Optimization
- Q-MMR provides a dimension-free finite-sample guarantee for off-policy evaluation.
- The framework learns weights inductively through a moment matching objective.
- It establishes connections to existing methods like importance sampling and linear FQE.
- The paper offers new insights into the coverage concept in offline reinforcement learning.
Read more
Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
Summary
The paper introduces Q-MMR, a novel theoretical framework for off-policy evaluation (OPE) in finite-horizon Markov Decision Processes (MDPs). The Q-MMR framework aims to learn a set of scalar weights for each data point, allowing the reweighted rewards to approximate the expected return under a target policy. This is achieved through an inductive top-down approach using a moment matching objective against a value-function discriminator class. A significant contribution of this work is the establishment of a data-dependent finite-sample guarantee that is dimension-free, meaning it does not rely on the statistical complexity of the function class. The authors also connect their method to existing approaches like importance sampling and linear Fitted-Q Evaluation (FQE), while providing new insights into the concept of coverage in offline reinforcement learning (RL). The paper addresses gaps in the current understanding of OPE, particularly the dimension dependence in existing analyses and the inconsistencies between general and linear analyses. The Q-MMR framework not only recovers key properties of linear analyses under weaker conditions but also generalizes recent findings in the linear setting to general function approximation.
Methodology
The Q-MMR framework employs a moment matching objective to learn a set of scalar weights for data points in a top-down manner. This approach allows for the reweighted rewards to approximate the expected return under the target policy, while ensuring that the resulting guarantees are dimension-free.
Results
The main result is a data-dependent finite-sample bound that does not depend on the complexity of the function class. This is a significant advancement over existing methods that typically exhibit dimension dependence. The Q-MMR framework also aligns with linear FQE in linear settings, demonstrating its robustness and applicability.
Implications
The findings of this paper have potential implications for improving off-policy evaluation methods in reinforcement learning, particularly in scenarios where function approximation is involved. The dimension-free guarantees could lead to more efficient and reliable evaluations of target policies based on historical data.
On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Theory
Large Language Models
Graph Learning
- Identification of catastrophic model collapse during fine-tuning of causal reasoning tasks.
- Introduction of a semantic loss function with graph-based logical constraints to prevent collapse.
- Demonstrated significant performance improvements over traditional fine-tuning methods.
- Comprehensive evaluation across 200,000+ samples validates the effectiveness of the proposed approach.
Read more
On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Summary
This paper addresses a critical issue in fine-tuning transformer models for causal reasoning tasks, specifically the phenomenon of catastrophic model collapse. The authors demonstrate that standard fine-tuning methods lead to models that produce trivial outputs, such as always predicting 'Yes' or 'No', regardless of the input structure. They identify that fine-tuning the Gemma 270M model on transitivity and d-separation tasks without incorporating a semantic loss function results in a 100% collapse rate. To counter this, the authors propose a novel semantic loss function that integrates graph-based logical constraints and employs dynamic lambda scheduling. Their approach significantly improves model performance, achieving 70.4% accuracy on transitivity tasks and 68.6% on d-separation tasks, representing a 42.7% improvement over collapsed baselines. Additionally, adversarial evaluations confirm that models utilizing semantic loss maintain stable predictions, while collapsed models perform poorly. The findings underscore the necessity of semantic loss for reliable causal reasoning in transformer models.
Methodology
The authors propose a semantic loss function that incorporates graph-based logical constraints and utilizes dynamic lambda scheduling to enhance the stability of predictions during fine-tuning. They conduct extensive experiments on the Gemma 270M model, focusing on transitivity and d-separation tasks, and benchmark their results against collapsed baselines using over 200,000 evaluation samples.
Results
The proposed semantic loss function leads to a marked improvement in model accuracy, achieving 70.4% on transitivity tasks and 68.6% on d-separation tasks, compared to a 42.7% improvement over collapsed baselines. Adversarial evaluations reveal that semantic models achieve 67-70% accuracy, while collapsed models fail with accuracy rates between 43-71%. This highlights the effectiveness of the semantic loss in maintaining stable causal reasoning.
Implications
The findings suggest that incorporating semantic loss is crucial for developing robust AI systems capable of causal reasoning. This approach could be applied to enhance the performance of transformer models in various domains requiring causal inference, potentially leading to more reliable AI applications in fields such as healthcare, finance, and autonomous systems.
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
NLP
Large Language Models
Optimization
- Weight decay is shown to be essential for meeting Villani's differential growth conditions in Transformer loss landscapes.
- The paper introduces empirical diagnostics to visualize the effects of weight decay on model curvature.
- Explicit convergence rates for Langevin-based optimizers are derived, linking theoretical insights with practical training efficiency.
- The authors provide a reproducible experimental suite for evaluating functional-analytic properties in large Transformer models.
Read more
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
Summary
This paper investigates the role of weight decay as a regularizer in Transformer models, providing a rigorous functional-analytic characterization of the standard Transformer objective, which combines cross-entropy loss with L2 regularization. The authors prove that this regularized loss satisfies Villani's criteria for coercive energy functions, demonstrating that the loss is infinitely differentiable, grows quadratically, has Gaussian-integrable tails, and meets specific differential growth conditions. They derive explicit log-Sobolev and Poincaré constants that link the regularization strength to convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds. To validate their theoretical findings, the authors introduce a scalable diagnostic tool and conduct experiments on the GPT-Neo-125M model across various datasets, confirming the predicted behaviors of the loss landscape. The results indicate that weight decay not only enhances empirical generalization but also establishes the necessary mathematical conditions for efficient optimization in deep learning, particularly in the context of Langevin dynamics.
Methodology
The authors utilize functional-analytic methods to characterize the Transformer loss landscape, proving its compliance with Villani's criteria. They introduce a diagnostic tool to estimate the divergence of a specific function related to the loss landscape and conduct empirical experiments on large-scale models to validate their theoretical claims.
Results
The experiments confirm the quadratic growth of the diagnostic function, spectral inflation of the Hessian, and exponential convergence behavior consistent with the log-Sobolev analysis. These findings validate the theoretical framework established in the paper and demonstrate the effectiveness of weight decay in improving generalization and optimization in Transformers.
Implications
The findings provide a deeper theoretical understanding of weight decay's role in Transformer optimization, potentially guiding future research on regularization techniques and optimization strategies in deep learning. The established mathematical conditions may also enhance the design of training algorithms for large language models.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
Federated Learning
Optimization
NLP
- Introduction of AS-LoRA, an adaptive framework for LoRA in federated learning.
- Layer-wise and round-wise adaptivity enhances optimization by selecting components based on training dynamics.
- Curvature-aware scoring function accelerates convergence and biases solutions towards flatter minima.
- AS-LoRA shows significant performance improvements over existing methods under strict differential privacy budgets.
Read more
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
Summary
This paper addresses the challenges of applying Low-Rank Adaptation (LoRA) in differentially private federated learning (FL), particularly the aggregation errors caused by LoRA's multiplicative structure, which are exacerbated by differential privacy noise. The authors propose AS-LoRA, an adaptive framework that allows for layer-wise and round-wise selection of LoRA components, thus overcoming the limitations of fixed update schedules. AS-LoRA incorporates a curvature-aware scoring function to determine the importance of each component dynamically, ensuring that the optimization process is more responsive to the training dynamics. The theoretical contributions of AS-LoRA include eliminating the reconstruction-error floor associated with fixed schedules, accelerating convergence, and maintaining privacy without incurring additional costs. Empirical results demonstrate significant improvements in performance across various datasets, including GLUE and MNLI, compared to existing federated LoRA methods, while also achieving lower aggregation costs.
Methodology
The authors developed AS-LoRA, which features three main axes: (i) layer-wise freedom for independent component selection, (ii) round-wise adaptivity for updating selections based on training progress, and (iii) a curvature-aware scoring function derived from a second-order approximation of the loss. This approach allows for dynamic adjustments to the optimization strategy, improving the overall effectiveness of federated learning with LoRA.
Results
AS-LoRA outperformed federated LoRA baselines by up to +7.5 percentage points on GLUE and +12.5 percentage points on MNLI-mm, while also matching or exceeding the performance of SVD-based aggregation methods at significantly lower aggregation costs (33–180 times lower) and with negligible communication overhead.
Implications
The findings suggest that adaptive selection strategies can significantly enhance the performance of federated learning systems, particularly in privacy-sensitive applications. This work opens avenues for more efficient and effective fine-tuning of large models in decentralized environments, potentially benefiting various applications in natural language processing and beyond.
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Theory
Optimization
Efficient ML
- The Slingshot Mechanism is primarily driven by finite-precision arithmetic in cross-entropy loss computation.
- Numerical Feature Inflation (NFI) is identified as a feedback loop causing abnormal growth in parameters and logits.
- The paper provides theoretical insights into the dynamics of loss spikes and proposes practical interventions to stabilize training.
- NFI dynamics can lead to rapid parameter norm growth, challenging classical gradient-flow analyses.
Read more
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Summary
This paper investigates the phenomenon of periodic loss spikes in deep neural networks during long-term training, known as the Slingshot Mechanism. While previous research attributed these spikes to intrinsic optimization dynamics, the authors argue that they are primarily caused by limitations in floating-point arithmetic precision. As training progresses into a high-confidence stage, the difference between the logits for the correct class and incorrect classes can exceed the absorption-error threshold, leading to a situation where the gradient for the correct class is rounded to zero during backpropagation. This breaks the zero-sum constraint of gradients across classes, resulting in a systematic drift in the classifier layer's parameters. The authors introduce the concept of Numerical Feature Inflation (NFI), which describes a feedback loop between the global classifier mean and the global feature mean, causing exponential growth in both. This mechanism explains the rapid norm growth preceding a Slingshot spike and the subsequent reappearance of gradients, leading to loss spikes. The paper also highlights that NFI can cause abnormal parameter growth in practical tasks, even when spikes are not visible. The findings provide a new perspective on Slingshot dynamics as a numerical issue in finite-precision training and suggest practical interventions to mitigate these effects.
Methodology
The authors conducted theoretical analysis to demonstrate how finite-precision effects in loss computation lead to the Slingshot Mechanism. They proved the existence of NFI through mathematical modeling of the interactions between gradients and logits during backpropagation. Empirical validation was performed by comparing training dynamics under different precision settings.
Results
The study found that loss spikes are linked to the rounding of gradients due to finite precision, leading to a breakdown of the zero-sum constraint and subsequent exponential growth in parameter norms. The introduction of NFI was shown to explain both the rapid growth of the classifier mean and the feature mean, resulting in significant implications for training stability.
Implications
The findings suggest that addressing numerical precision issues in loss computation is vital for improving the stability of deep learning models, especially as low-precision training becomes more prevalent. The proposed interventions could help mitigate loss spikes and abnormal parameter growth, enhancing the reliability of neural network training.
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
Computer Vision
NLP
Theory
- E ≥ 0.5 guarantees zero dead experts, eliminating the need for auxiliary losses.
- Dead experts can revive, contradicting the traditional view of permanent expert death.
- Task complexity affects the critical threshold for E, indicating a need for adaptive strategies.
- Ecological structures in MoE are temperature-invariant, suggesting stable diagnostics.
Read more
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
Summary
This paper introduces a novel dimensionless control parameter, E = T · H/(O + B), which predicts the health of expert ecologies in Mixture-of-Experts (MoE) models. By combining four hyperparameters—routing temperature (T), routing entropy weight (H), oracle weight (O), and balance weight (B)—the parameter E serves as a single metric to assess whether MoE models will maintain a healthy distribution of active experts or succumb to 'dead experts' that receive no gradient signal. Through 12 controlled experiments across vision and language datasets, the study demonstrates that maintaining E ≥ 0.5 is sufficient to prevent dead experts without the need for auxiliary load-balancing losses. The findings reveal that dead experts can be revived, challenge the universality of ortho toxicity, and indicate that task complexity influences the critical E threshold. Additionally, the research shows that ecological structures are stable across a wide range of temperatures, and overfitting is independent of expert ecological health. The proposed E parameter is likened to the Reynolds number in fluid dynamics, providing a compact diagnostic tool for MoE training.
Methodology
The study employs a Hierarchical Mixture-of-Experts architecture and conducts 12 controlled experiments across various datasets, including CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. The experiments assess the impact of the E parameter on expert performance and ecological health.
Results
The results confirm that maintaining E ≥ 0.5 prevents dead experts across multiple datasets. The research also documents the revival of dead experts, identifies task complexity as a significant factor in determining the critical E threshold, and establishes that ecological structures remain stable across varying temperatures. Additionally, it shows that model overfitting can occur independently of expert ecological health.
Implications
The findings suggest that the E parameter can be used as a diagnostic tool for optimizing MoE training, potentially leading to more efficient and effective model designs. This could have significant implications for large-scale deep learning applications in both vision and language processing.
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Large Language Models
Reinforcement Learning
Theory
- Introduction of the Matrix-Decoupled Concentration (MDC) framework to address concentration bounds for autoregressive sequences.
- Resolution of scalar collapse and causal structure mismatch issues in existing concentration frameworks.
- Establishment of a McDiarmid-type inequality that guarantees dimension-free O(1) variance proxies for sparse rewards.
- Demonstration of optimal transport constants recovery for homogeneous Markov chains and order-optimal bounds for causal trees.
Read more
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Summary
This paper addresses the challenges of establishing concentration bounds for autoregressive sequences in Large Language Models (LLMs), particularly in the context of sparse long-context rewards. The author identifies two main bottlenecks in existing frameworks: the issue of scalar collapse, which inflates variance bounds to O(N), and the mismatch of causal structures in autoregressive settings. To overcome these challenges, the paper introduces the Matrix-Decoupled Concentration (MDC) framework, which utilizes a causal interdependence matrix to maintain the structural integrity of dependencies and sensitivities. This framework allows for the derivation of a McDiarmid-type inequality that provides dimension-free O(1) variance proxies, thereby ensuring stability in long-context reasoning without the inflation seen in classical methods. The MDC framework is shown to recover optimal constants for Markov chains and establish order-optimal bounds for causal trees, making it a significant advancement in the theoretical foundation of sequence-level evaluations in autoregressive models.
Methodology
The paper develops the MDC framework by encoding the conditional dependency structure into a transition variance matrix. It establishes a McDiarmid-type inequality that uses matrix-vector multiplication to derive variance proxies, thus preserving the coordinate-wise sparsity of rewards and preventing scalar collapse. The methodology involves rigorous mathematical formulations and proofs to validate the proposed framework.
Results
The MDC framework successfully provides dimension-free O(1) variance proxies for sparse long-context rewards, overcoming the O(N) inflation seen in classical methods. It also recovers optimal constants for Markov chains and establishes order-optimal bounds for causal trees, demonstrating its effectiveness in autoregressive settings.
Implications
The findings have significant implications for the theoretical understanding and practical applications of autoregressive models in machine learning, particularly in enhancing the stability and reliability of sequence-level evaluations in LLMs and other non-Markovian sequential generation paradigms.
SMolLM: Small Language Models Learn Small Molecular Grammar
NLP
Generative Models
Interpretability
- SMolLM achieves 95% validity in generating SMILES with only 53K parameters.
- The model outperforms a larger GPT model (527K parameters) in terms of validity.
- A weight-shared transformer architecture allows for mechanistic interpretability of the generation process.
- The model resolves SMILES constraints in a structured manner across multiple passes.
Read more
SMolLM: Small Language Models Learn Small Molecular Grammar
Summary
The paper introduces SMolLM, a small language model with 53K parameters designed to generate valid SMILES strings for molecular design. Despite its compact size, SMolLM achieves a 95% validity rate on the ZINC-250K drug-like molecule benchmark, outperforming a standard GPT model with ten times more parameters. The authors explore how SMolLM learns chemical grammar through a weight-shared transformer architecture, which allows the same block to be reused across multiple passes. This mechanism enables the model to resolve SMILES constraints in a systematic order: brackets first, followed by rings, and finally valence. The authors employ error classification, linear probing, and sparse autoencoders to analyze the model's performance and interpretability. The findings suggest that the model's ability to generate valid molecules is rooted in its iterative computation process, making it a valuable tool for studying formal languages and molecular generation.
Methodology
The authors trained SMolLM, a weight-shared transformer model, on SMILES strings from the ZINC-250K dataset. They conducted a series of analyses, including error classification, linear probing, and causal ablation, to understand how the model learns and resolves chemical grammar constraints during the generation process.
Results
SMolLM demonstrated a 95% validity rate in generating SMILES strings, significantly outperforming a larger GPT model (87.6% validity) while using an order of magnitude fewer parameters. The model's performance improved systematically across passes, with brackets resolved by the second pass, rings by the fourth, and valence by the eighth.
Implications
The results suggest that smaller, weight-shared models can effectively learn complex grammar in molecular generation tasks, potentially leading to more efficient and interpretable models in drug discovery and materials science. This approach may also be applicable to other formal languages with hierarchical constraints.
MinMax Recurrent Neural Cascades
Theory
Efficient ML
NLP
- MinMax RNCs can express all regular languages and are capable of parallel evaluation with logarithmic complexity.
- The architecture maintains bounded states and outputs, preventing issues of vanishing or exploding gradients.
- Empirical results show superior performance on synthetic tasks compared to existing RNN architectures.
- A MinMax RNC with 127M parameters achieves competitive performance in next-token prediction tasks.
Read more
MinMax Recurrent Neural Cascades
Summary
This paper introduces MinMax Recurrent Neural Cascades (RNCs), a novel architecture that leverages MinMax algebra to achieve recurrence in neural networks. The authors demonstrate that this approach provides significant advantages over traditional recurrent neural networks (RNNs), particularly in terms of expressivity, stability, and gradient behavior. MinMax RNCs can represent all regular languages and can be evaluated in parallel with logarithmic runtime concerning input length. The architecture maintains bounded states and activations, ensuring that gradients do not vanish or explode, which is a common issue in conventional RNNs. Empirical evaluations show that MinMax RNCs outperform state-of-the-art models on synthetic tasks and demonstrate competitive performance in next-token prediction tasks, indicating their potential for real-world applications.
Methodology
The authors propose a MinMax recurrence relation that replaces standard addition with max and multiplication with min. This formulation allows for the construction of recurrent networks that exhibit desirable theoretical properties. The paper includes proofs of expressivity, complexity, and stability, alongside empirical evaluations on various tasks.
Results
MinMax RNCs were shown to perfectly solve star-free tasks and generalize effectively up to a sequence length of 1 million. They outperformed several state-of-the-art models, including Mamba and LSTMs, in synthetic evaluations. Additionally, a trained MinMax RNC for next-token prediction on OpenWebText achieved a loss comparable to GPT-2, demonstrating its effectiveness in practical applications.
Implications
The findings suggest that MinMax RNCs could be a powerful alternative to traditional RNNs and Transformers, particularly in applications requiring sequence processing without the common gradient issues. Their ability to maintain expressivity while being computationally efficient opens avenues for further research and application in various domains.
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
Computer Vision
Robotics
Interpretability
- Introduces a hierarchical attribution framework for analyzing decision-level risks in autonomous driving.
- Develops a coarse-to-fine attribution method that integrates multi-view camera inputs for trajectory planning.
- Derives three statistics to quantify reliance on visual evidence, serving as predictive signals for planning risk.
- Demonstrates strong correlation between attribution statistics and planning risks through extensive experiments.
Read more
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
Summary
This paper addresses the challenge of understanding decision-making in end-to-end autonomous driving systems, particularly focusing on the risks associated with trajectory planning. Traditional methods for risk assessment often rely on auxiliary models or textual explanations, which do not effectively link visual evidence to planning decisions. The authors propose a hierarchical attribution framework that utilizes a coarse-to-fine region attribution strategy to analyze the influence of multi-view camera inputs on trajectory outputs. By employing L2 consistency with the original trajectory as the objective, the framework generates attribution maps that reveal critical visual evidence behind planning decisions. The authors derive three key attribution statistics—attribution entropy, within-camera spatial variance, and cross-camera Gini coefficient—that serve as predictive signals for planning risk. These statistics are evaluated through experiments on the nuScenes dataset using three autonomous driving models: BridgeAD, UniAD, and GenAD. The results indicate a significant correlation between the derived statistics and planning risks, demonstrating the potential of attribution as a tool for risk prediction in autonomous driving.
Methodology
The authors propose a hierarchical attribution framework that utilizes a coarse-to-fine region attribution strategy. This method treats the six camera views as a unified space and employs L2 trajectory consistency as the objective for attribution. The framework generates attribution maps that connect visual evidence to trajectory outputs, and three statistics are derived to quantify reliance on visual evidence for risk prediction.
Results
Experiments conducted on the nuScenes dataset reveal that the proposed attribution statistics correlate significantly with planning risks, achieving Spearman correlations of 0.30 ± 0.07 with trajectory error and an AUROC of 0.77 ± 0.04 for collision detection. The method demonstrates generalization to held-out scenes with negligible performance degradation.
Implications
The findings suggest that attribution can be effectively utilized not only for post-hoc explanations but also as a predictive tool for assessing risks in autonomous driving systems. This could enhance the interpretability and safety of autonomous vehicles, leading to more reliable deployment in real-world scenarios.
In-Context Black-Box Optimization with Unreliable Feedback
Optimization
- Introduction of FICBO, a framework for optimizing black-box functions with unreliable feedback.
- Utilization of a structured feedback prior to model the reliability of auxiliary feedback sources.
- Empirical results show FICBO's superiority over classical and amortized optimization baselines.
- The model's ability to adaptively infer feedback reliability enhances query selection.
Read more
In-Context Black-Box Optimization with Unreliable Feedback
Summary
This paper introduces Feedback-informed In-Context Black-Box Optimization (FICBO), a novel framework designed to enhance black-box optimization in scenarios where auxiliary feedback is unreliable. Traditional black-box optimization methods often struggle with biased or misleading feedback from experts or heuristics, which can hinder performance. FICBO addresses this challenge by conditioning a pretrained optimizer on both the historical optimization data and auxiliary feedback during the optimization process. The authors propose a structured feedback prior that models the varying reliability of feedback sources, allowing the optimizer to adaptively infer the reliability of feedback at test time. This approach leverages a transformer model that learns to improve query selection based on the observed objective values and auxiliary signals. The empirical evaluation demonstrates that FICBO outperforms existing baselines on both synthetic and real-world tasks, showcasing its robustness to unreliable feedback and its ability to exploit informative signals effectively. Additionally, the study provides insights into the model's interpretability and decision-making processes, highlighting how it perceives and utilizes test-time feedback sources.
Methodology
FICBO employs a feedback-aware amortized Bayesian optimization framework, where a transformer model is pretrained on a distribution of tasks with varying feedback reliability. At test time, the model conditions its predictions on both historical optimization data and auxiliary feedback, allowing it to estimate the reliability of different feedback sources dynamically. The training involves optimizing a reward function based on the performance across sampled tasks, enabling the model to learn effective query selection strategies.
Results
The empirical evaluation of FICBO on synthetic and real-world tasks demonstrates significant improvements in optimization performance compared to classical Bayesian optimization methods and other state-of-the-art amortized optimizers. The model effectively exploits informative feedback while maintaining robustness against weak or misleading sources, leading to better query selection and overall optimization outcomes.
Implications
FICBO has potential applications in various fields requiring optimization under uncertainty, such as engineering design, scientific experimentation, and machine learning hyperparameter tuning. Its ability to handle unreliable feedback can significantly enhance the efficiency and effectiveness of optimization processes in complex environments.
Towards Metric-Faithful Neural Graph Matching
Graph Learning
Theory
- Introduces a geometric framework linking encoder geometry to GED estimation quality.
- Demonstrates that bi-Lipschitz encoders improve the stability and accuracy of GED surrogates.
- Establishes a theoretical basis for the influence of encoder distortion on downstream estimation performance.
- Implements FSW-GNN as a drop-in replacement in neural GED architectures, resulting in significant performance improvements.
Read more
Towards Metric-Faithful Neural Graph Matching
Summary
This paper addresses the challenge of estimating Graph Edit Distance (GED), a crucial metric for structural graph similarity that is NP-hard to compute. The authors propose a theoretical framework that connects the geometry of graph encoders used in neural graph matching to the quality of GED estimation. They categorize neural GED estimators into two classes: graph similarity predictors and matching-based methods, and demonstrate that the encoder's geometric properties significantly influence the performance of these estimators. Specifically, they show that bi-Lipschitz encoders can yield more stable and accurate GED surrogates. The authors implement their findings using FSW-GNN, a bi-Lipschitz encoder, as a replacement in existing neural GED architectures, leading to improved prediction and ranking metrics across various datasets. Their analysis indicates that better representation geometry enhances the conditioning of the surrogate quantities used for GED estimation, thus establishing a strong link between encoder design and estimation fidelity.
Methodology
The authors develop a theoretical framework to analyze the relationship between encoder geometry and GED estimation quality. They categorize neural GED methods into graph similarity predictors and matching-based estimators, focusing on the impact of bi-Lipschitz geometry. They empirically validate their theoretical claims by integrating the FSW-GNN encoder into existing architectures and evaluating performance across benchmark datasets.
Results
The integration of bi-Lipschitz encoders into neural GED architectures led to significant improvements in GED prediction and ranking metrics across various model families. The theoretical framework provided insights into how encoder geometry affects the quality of GED surrogates, with empirical results supporting the hypothesis that improved representation geometry enhances estimation fidelity.
Implications
The findings suggest that careful design of graph encoders can lead to more accurate and reliable graph matching systems, which are essential in applications such as molecular retrieval, program analysis, and structured search. The work also opens avenues for further research into geometry-aware neural architectures in graph learning.
Differentiable Parameter Optimization for DAEs with State-Dependent Events
Optimization
Theory
- Formulates parameter learning for hybrid DAEs as a constrained least-squares problem.
- Develops two gradient-computation strategies: automatic differentiation through simulation and explicit discrete adjoint method.
- Compares the two methods regarding their handling of gradients, event times, and implementation complexity.
- Both methods provide gradients that are local to the event path selected by the forward simulation.
Read more
Differentiable Parameter Optimization for DAEs with State-Dependent Events
Summary
This paper addresses the challenges of differentiable parameter optimization in differential-algebraic equations (DAEs) that include state-dependent events. DAEs are prevalent in various engineering and physical systems where continuous dynamics are interrupted by discrete events, such as switches and impacts. The authors propose a formulation of the parameter learning problem as a constrained least-squares problem that incorporates DAE dynamics, algebraic constraints, guard equations, and reset maps. They develop two gradient-computation strategies: an automatic-differentiation-through-simulation method and an explicit discrete-adjoint method. The first method utilizes automatic differentiation to differentiate through the algebraic solve and manage events via segmented integration. The second method constructs a discrete residual system that treats residuals as equality constraints, allowing for the computation of gradients through Lagrange multipliers. The paper compares these methods in terms of gradient interpretation, event-time handling, implementation complexity, and local validity. Both approaches yield gradients for the event path determined by the forward simulation, under specific assumptions about event crossings and parameter perturbations.
Methodology
The authors propose two main methodologies for gradient computation: 1) An automatic-differentiation-through-simulation method that differentiates through the algebraic solve using the implicit function theorem and manages events with segmented integration. 2) An explicit discrete-adjoint method that constructs an event-split residual system, treating residuals as equality constraints and using Lagrange multipliers to compute gradients.
Results
Both gradient-computation strategies successfully compute gradients for parameter optimization in DAEs with state-dependent events. The paper demonstrates that both methods are valid under fixed event ordering and transversal guard crossings, providing insights into their respective advantages and limitations.
Implications
The findings have significant implications for parameter estimation and optimization in hybrid systems, particularly in engineering applications where DAEs are common. The methodologies can enhance the accuracy and efficiency of simulations in systems with complex dynamics and event-driven behaviors.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Reinforcement Learning
Large Language Models
Optimization
- AOPD addresses high variance updates and vanishing gradients in standard on-policy distillation.
- The framework utilizes localized divergence minimization to improve learning in non-positive advantage regions.
- AOPD shows significant performance improvements on mathematical reasoning benchmarks compared to traditional methods.
- The method enhances capability retention during sequential tool-use adaptation.
Read more
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Summary
This paper introduces Asymmetric On-Policy Distillation (AOPD), a novel framework designed to enhance the effectiveness of on-policy distillation in reinforcement learning by addressing the limitations of standard advantage-weighted policy gradients. The authors identify three key weaknesses in traditional on-policy distillation: high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks due to insufficient corrective signals. AOPD mitigates these issues by replacing ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while retaining positive reinforcement learning. The framework allows for better exploitation of successful trajectories and provides stronger corrective signals when the model encounters optimization bottlenecks. The authors evaluate AOPD on mathematical reasoning benchmarks, demonstrating that it consistently outperforms standard on-policy distillation methods, particularly under weak initialization conditions, and maintains higher policy entropy and better capability retention during sequential tool-use adaptation.
Methodology
The authors propose AOPD, which combines principles from reinforcement learning and supervised learning. It employs two learning modes: exploitation, where the model reinforces successful exploration, and imitation, where it receives guidance from the teacher model in cases of negative scoring. This approach allows the student model to adaptively switch between learning from its own trajectories and receiving direct feedback from the teacher, thereby improving the overall learning signal.
Results
Experiments reveal that AOPD consistently outperforms standard on-policy distillation methods, achieving average performance gains of 4.09 and 8.34 under strong and weak initialization conditions, respectively. AOPD also demonstrates improved capability retention during sequential tool-use adaptation and maintains higher policy entropy throughout training.
Implications
The findings suggest that AOPD could be applied to enhance the training of smaller models in various applications, particularly in scenarios requiring robust performance in mathematical reasoning and other complex tasks. The framework's ability to balance exploitation and imitation may also lead to advancements in other areas of reinforcement learning and knowledge distillation.
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS
Graph Learning
Optimization
Theory
- Graph Normalization (GN) provides a differentiable solution to the NP-hard MWIS problem.
- GN guarantees convergence to a valid binary output without the need for external annealing schedules.
- The methodology utilizes a quasi-Newton descent approach through Majorization-Minimization.
- GN demonstrates superior performance on large graphs, achieving near-optimal solutions rapidly.
Read more
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS
Summary
This paper introduces Graph Normalization (GN), a novel dynamical system designed to provide a differentiable approximation for the NP-hard Maximum Weight Independent Set (MWIS) problem. MWIS is crucial in various combinatorial optimization tasks, including optimal assignment and scheduling. GN guarantees convergence to a binary indicator of a Maximum Independent Set (MIS), unlike traditional methods such as Belief Propagation. The methodology employs a fast quasi-Newton descent through Majorization-Minimization, improving the MWIS relaxed primal objective systematically. The authors establish a connection between GN and the Replicator Dynamics of a nonlinear evolutionary game, demonstrating that the average fitness aligns with the MWIS primal objective and increases over time. GN also extends the Motzkin-Straus theorem, linking MISes to local minima of a quadratic form. The paper showcases GN's effectiveness as a binarization engine for the Bregman-Sinkhorn relaxed MWIS solver, achieving solutions within a 1% gap of the best-known results on large graphs (up to 1M edges) in seconds. The framework opens new avenues for deep learning architectures that require differentiable, hard decisions under constraints, with applications in structured sparse attention, dynamic network pruning, and Mixture-of-Experts.
Methodology
The authors developed Graph Normalization (GN) as a dynamical system that employs a quasi-Newton descent method through Majorization-Minimization to improve the MWIS relaxed primal objective. GN is shown to converge to a binary indicator of a Maximum Independent Set (MIS) and is linked to the Replicator Dynamics of a nonlinear evolutionary game.
Results
GN was tested on real-world benchmarks with graphs containing up to 1 million edges, achieving solutions within a 1% gap of the best-known results in a matter of seconds on standard laptop CPUs. This performance highlights GN's efficiency and scalability compared to existing methods.
Implications
The introduction of GN opens new possibilities for integrating combinatorial optimization tasks into deep learning frameworks, enabling end-to-end learning of constrained tasks across various domains such as computer vision, computational biology, and resource allocation.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
NLP
Large Language Models
Efficient ML
- PBKV predicts future agent invocations to optimize KV-cache management in dynamic workflows.
- The system employs hierarchical eviction and conservative prefetching to enhance cache reuse.
- PBKV demonstrates significant performance improvements over existing cache management techniques.
- The predictor's design is robust to errors, ensuring stable performance across varying conditions.
Read more
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
Summary
This paper addresses the challenge of managing Key/Value (KV) caches in dynamic workflows involving large language model (LLM)-based multi-agent systems. Existing cache management techniques either focus on individual agents or assume static workflows, which limits their effectiveness in real-world scenarios where agent invocation sequences can change based on task context. The authors propose a novel system called PBKV (Prediction-Based KV-Cache Management) that predicts future agent invocations by integrating historical workflow data and the current task context. PBKV conservatively manages cache entries based on these predictions, retaining high-potential entries in GPU memory while minimizing the impact of prediction errors. The methodology includes hierarchical eviction of cache entries and conservative prefetching strategies to enhance cache reuse without incurring significant costs from incorrect predictions. Experimental results demonstrate that PBKV achieves significant performance improvements, including up to 1.85× speedup over the Least Recently Used (LRU) policy in dynamic workflows and 1.26× speedup over the state-of-the-art KVFlow in static workflows. The findings highlight the importance of effective KV-cache management in optimizing the performance of multi-agent systems.
Methodology
The authors developed PBKV, which utilizes multi-step predictions based on historical workflows and current task context. The system incorporates hierarchical eviction strategies to prioritize cache entries with higher reuse potential and employs conservative prefetching to manage GPU memory efficiently. The predictor is designed to be robust against errors, ensuring that performance does not degrade significantly even with imperfect predictions.
Results
PBKV achieved up to 1.85× speedup over LRU in dynamic workflows and up to 1.26× speedup over KVFlow in static workflows. Additionally, it improved the KV-cache hit rate by up to 2.55× compared to LRU, demonstrating its effectiveness in optimizing cache management.
Implications
The findings suggest that effective KV-cache management can significantly enhance the performance of LLM-based multi-agent systems, making it applicable in various domains requiring complex reasoning and automation. The robust design of PBKV also indicates potential for integration into existing inference engines to improve their efficiency.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Reinforcement Learning
Theory
Optimization
- Introduction of Approximate Next Policy Sampling (ANPS) as an alternative to conservative policy updates.
- Development of Stable Value Approximate Policy Iteration (SV-API) to implement ANPS effectively.
- Demonstration of improved performance on RL benchmarks with larger policy updates.
- Establishment of theoretical bounds that highlight the importance of aligning training distribution with the next policy's state visitation.
Read more
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Summary
This paper addresses the challenges associated with conservative policy updates in reinforcement learning (RL), which limit policy changes to ensure the value function is accurate on the state-visitation distribution of the updated policy. The authors propose a novel approach called Approximate Next Policy Sampling (ANPS), which shifts the training data to align with the future policy's state distribution rather than constraining the policy update. This method aims to enhance the value function estimate at states critical for policy improvement. The authors introduce Stable Value Approximate Policy Iteration (SV-API), a modification of standard approximate policy iteration algorithms that maintains a fixed target policy while an updated behavioral policy collects relevant experience. The policy update is only executed once the value estimates stabilize. The empirical results demonstrate that SV-API, when applied to existing algorithms like Proximal Policy Optimization (PPO), achieves comparable or superior performance on high-dimensional discrete (Atari) and continuous control tasks while allowing for larger policy updates. This work presents ANPS as a viable alternative to traditional conservative updates, addressing a fundamental issue in RL.
Methodology
The authors propose ANPS, which modifies the training distribution to approximate the next policy's state visitation. They introduce SV-API, which keeps the target policy fixed while an updated behavioral policy gathers experience. The algorithm commits to policy updates only after ensuring the stability of value estimates, thus addressing the chicken-and-egg problem in RL.
Results
SV-API, when applied to PPO, results in Stable Value PPO (SV-PPO), which matches or exceeds the performance of existing methods on high-dimensional discrete (Atari) and continuous control tasks. The approach allows for significantly larger target policy updates while maintaining safety in policy improvement.
Implications
The findings suggest that ANPS can provide a more effective framework for policy updates in RL, potentially leading to more efficient learning and better performance in complex environments. This approach could influence the design of future RL algorithms by reducing the constraints on policy updates.