AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
Reinforcement Learning
Robotics
- Introduction of Survival Reinforcement Learning (SRL) as a scalable self-supervised RL method.
- SRL maximizes dwell time at goals, addressing limitations of existing contrastive methods.
- Demonstrated superior performance of SRL on long-horizon locomotion tasks compared to state-of-the-art CRL.
- Empirical evidence supports the effectiveness of classification-based objectives in scaling RL.
Read more
Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
Summary
This paper introduces Survival Reinforcement Learning (SRL), a novel online classification-based approach that enhances the survival value learning framework by maximizing the agent's dwell time at target goals. While previous self-supervised Contrastive Reinforcement Learning (CRL) has demonstrated impressive scaling capabilities, it struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma associated with contrastive losses. SRL circumvents the structural limitations of CRL and addresses the undesirable 'bang-bang' control solutions typical of survival frameworks. Through extensive evaluations on various robotic benchmarks, SRL achieves competitive performance on manipulation tasks and significantly outperforms CRL by 2x to 8x on stable, long-horizon locomotion tasks. The findings suggest that classification-based methods could be pivotal in advancing scalable reinforcement learning.
Methodology
The authors extend the survival learning framework to develop SRL, which focuses on maximizing the agent's time spent at goal states. The methodology involves classifying state-action pairs based on their trajectories towards goals and employing a dwell time at goal formulation to stabilize the agent's position after reaching the goal. The architecture is built upon previous work that emphasizes depth-scaling behavior, and the performance is evaluated across various robotic environments.
Results
SRL achieves competitive results on challenging goal-reaching tasks, particularly excelling in AntMaze environments. It matches the performance of scaled CRL on manipulation tasks and significantly outperforms it on long-horizon locomotion tasks, demonstrating a 2x to 8x improvement. These results highlight the potential of classification-based methods in enhancing the scalability of reinforcement learning.
Implications
The development of SRL suggests new pathways for scalable self-supervised reinforcement learning, potentially impacting various applications in robotics and autonomous systems. The findings advocate for the integration of classification-based objectives in RL frameworks, which could lead to more robust and efficient learning algorithms.
Latent Diffusion Pretraining for Crystal Property Prediction
Generative Models
Graph Learning
Efficient ML
- Introduction of CrysLDNet, a latent diffusion-based pretraining framework for crystal property prediction.
- Integration of a Variational Autoencoder with a latent diffusion model to effectively learn from unlabeled data.
- Significant performance improvements over existing models, particularly in low-data regimes.
- Backbone-agnostic design allows for future model enhancements without retraining.
Read more
Latent Diffusion Pretraining for Crystal Property Prediction
Summary
This paper addresses the challenge of predicting crystal properties efficiently and accurately, which is crucial for materials design. The authors propose a novel framework called CrysLDNet that utilizes latent diffusion pretraining to overcome the scarcity of labeled data in crystal property prediction. By integrating a Variational Autoencoder (VAE) with a latent diffusion model, the framework maps 3D crystal structures into a smooth latent space, allowing for effective learning from large-scale unlabeled data. The pretrained model can then be fine-tuned for specific property prediction tasks. The authors demonstrate that CrysLDNet significantly outperforms both training-from-scratch and existing pretrained models on popular datasets, achieving notable improvements in prediction accuracy, particularly in low-data scenarios. The framework is also designed to be backbone-agnostic, allowing for future enhancements without retraining the entire model. This work highlights the potential of latent diffusion methods in advancing crystal property prediction and materials science.
Methodology
The methodology involves a two-component framework: a Variational Autoencoder (VAE) that encodes 3D crystal structures into a smooth latent space, and a Latent Diffusion Model (LDM) that learns the distribution of these latent representations through a process of progressive noising and denoising. The model is pretrained on large amounts of unlabeled data and then fine-tuned on small labeled datasets for specific property predictions.
Results
CrysLDNet achieved significant relative error reductions of 4.26% to 19.34% compared to strong diffusion- and GNN-based baselines across various properties. The model demonstrated particularly strong performance in low-data conditions, where labeled samples are limited. Additionally, replacing the VAE encoder with more advanced architectures led to further performance improvements of 10.46% on the JARVIS dataset and 12.39% on the MP dataset.
Implications
The findings suggest that latent diffusion pretraining can effectively address data scarcity in materials science, enabling more accurate predictions of crystal properties. The framework's flexibility and potential for integration with future models could significantly enhance the efficiency of materials design processes.
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
Efficient ML
- Identifies intra-prompt long tails as a significant source of inefficiency in RL for LLMs.
- Introduces DARTS, a novel framework for active distribution shaping to improve rollout efficiency.
- Employs a dual-end length sampling strategy and adaptive redundancy allocation to optimize trajectory selection.
- Demonstrates significant acceleration in RL training processes without degrading model performance.
Read more
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Summary
This paper addresses the inefficiencies in Reinforcement Learning (RL) for Large Language Models (LLMs) caused by long-tail response length distributions. While previous approaches have focused on scheduling to mitigate the impact of long tails, this work identifies the root cause of inefficiency as the distribution itself. The authors characterize the long-tail distribution at a finer granularity, revealing intra-prompt long tails that often consist of verbose and ineffective responses. To tackle this issue, they propose DARTS (Distribution-Aware Active Rollout Trajectory Shaping), a novel paradigm that actively shapes the rollout distribution towards conciseness and certainty. DARTS employs a distribution-aware trajectory sampling mechanism that selects trajectories from a redundant exploration space and an adaptive redundancy allocation scheme to optimize both shaping effectiveness and system efficiency. The proposed method significantly accelerates the RL training process without compromising model performance, achieving up to 1.77× acceleration over state-of-the-art systems.
Methodology
The authors developed DARTS, which includes a distribution-aware trajectory sampling mechanism that selects optimal trajectories from a redundant exploration space. They also implemented a variance-based adaptive redundancy allocation scheme to balance shaping effectiveness with system efficiency. Additionally, system-level optimizations such as variance-guided tail pruning and a token-level streaming pipeline were introduced to enhance performance.
Results
Experiments showed that DARTS can accelerate the RL training process by up to 1.77× compared to existing state-of-the-art systems, while maintaining model performance. The distribution shaping effectively reduced the overhead caused by long-tail distributions, leading to improved computational resource utilization.
Implications
The findings suggest that addressing the distribution characteristics of rollout trajectories can lead to more efficient RL training processes, which is crucial for the development of advanced LLMs. This approach could be applied to various RL tasks, enhancing the performance of models in complex reasoning and decision-making scenarios.
CoMem: Context Management with A Decoupled Long-Context Model
NLP
Large Language Models
Efficient ML
- COMEM decouples memory management from reasoning, allowing for specialized models for efficient history compression.
- The k-step-off asynchronous pipeline significantly reduces decoding overhead by overlapping memory summarization with agent execution.
- A novel reward-driven training methodology aligns the memory model to ensure effective decision-making.
- COMEM achieves a 1.4x latency improvement over traditional long-context solutions while preserving performance.
Read more
CoMem: Context Management with A Decoupled Long-Context Model
Summary
The paper presents COMEM, a novel framework designed to enhance context management in agentic models, particularly for long-horizon tasks. Traditional methods often face significant decoding overhead due to the need for summarization of extensive interaction histories, which adversely affects response latency. COMEM addresses this by decoupling memory management from the primary agent workflow, allowing these processes to operate in parallel. The authors propose a k-step-off asynchronous pipeline that overlaps memory summarization with agent inference, effectively masking the latency associated with context processing. To ensure the memory model captures essential statistics for decision-making, a reward-driven training strategy is introduced. Theoretical analysis indicates that COMEM achieves a superior efficiency-effectiveness trade-off compared to coupled architectures. Experimental results on SWE-Bench-Verified demonstrate that COMEM can reduce latency by 1.4 times compared to standard long-context solutions while maintaining competitive performance. This framework not only enhances the efficiency of context processing but also scales favorably with increased system throughput, paving the way for independent optimization of agent reasoning and memory compression.
Methodology
The authors developed COMEM by creating an asynchronous pipeline that allows memory summarization to occur in parallel with agent inference. They employed a reward-driven training strategy to align the memory model with the agent's decision-making needs, ensuring that the compressed memory captures sufficient statistics for effective reasoning.
Results
Extensive experiments on SWE-Bench-Verified showed that COMEM provides a 1.4x reduction in latency compared to vanilla long-context models while maintaining competitive performance levels. The results indicate that the framework scales well with increased system throughput.
Implications
COMEM's approach to decoupling memory management from reasoning can significantly enhance the efficiency of agentic systems, particularly in applications requiring long-context processing. This framework could lead to improved user experiences in real-time systems and facilitate the development of more sophisticated autonomous agents capable of handling complex tasks.
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
Multimodal
Generative Models
Efficient ML
- DREAM-S integrates a NAS framework to optimize draft model configurations for speedup.
- The framework employs dynamic selection of intermediate features from the target model to enhance draft model accuracy.
- DREAM-S achieves up to a 3.85× speedup over conventional decoding methods.
- The approach significantly outperforms existing speculative decoding techniques in various multimodal tasks.
Read more
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
Summary
The paper introduces DREAM-S, a novel framework for speculative decoding (SD) tailored for vision-language models (VLMs). While SD has been effective in accelerating autoregressive generation in large language models (LLMs), its application in VLMs has been limited. DREAM-S employs a neural architecture search (NAS) to optimize the interaction between draft and target models, identifying the best draft model architecture and interaction strategy for specific hardware platforms. It also incorporates adaptive intermediate feature distillation guided by attention entropy, enhancing draft training efficiency. Experimental results demonstrate that DREAM-S achieves up to a 3.85× speedup compared to standard decoding methods while significantly outperforming existing SD baselines across various multimodal tasks. This work highlights the potential of integrating NAS and feature distillation in improving the performance and efficiency of VLMs, paving the way for more advanced multimodal applications.
Methodology
DREAM-S utilizes a neural architecture search (NAS) framework to identify optimal draft model configurations, input pruning ratios, and interaction strategies with the target model. It employs adaptive intermediate feature distillation based on attention entropy to improve draft training efficiency and accuracy.
Results
DREAM-S demonstrates a speedup of up to 3.85× compared to traditional decoding methods across multiple vision-language models and tasks, while maintaining high acceptance rates for generated tokens.
Implications
The advancements presented in DREAM-S could lead to more efficient and effective multimodal generation systems, enabling broader applications in areas such as image captioning, visual question answering, and content-based search.
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Generative Models
Interpretability
Optimization
- DensityFlow provides a novel approach to generating robust counterfactual explanations by focusing on high-density data regions.
- The framework utilizes Neural ODEs and a density score learned via Noise Contrastive Estimation to guide counterfactual generation.
- A local proxy distillation mechanism enhances efficiency in black-box settings by minimizing redundant queries.
- Experimental results show significant improvements in robustness and validity compared to traditional ensemble methods.
Read more
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Summary
This paper addresses the challenge of generating reliable counterfactual explanations (CEs) in machine learning models, particularly in low-density regions where classifiers exhibit high variance. The authors introduce DensityFlow, a generative framework that constructs robust CEs by focusing on high-confidence data manifolds. The framework employs a continuous-time dynamics model parameterized by Neural ODE, guided by a differentiable density score learned through Noise Contrastive Estimation. This approach effectively avoids uncertain low-density areas during counterfactual generation. Additionally, for black-box models, a local proxy distillation mechanism is proposed to align a lightweight surrogate model with the target model, optimizing the generation process with minimal queries. Experimental results demonstrate that DensityFlow outperforms existing ensemble-based methods in terms of validity and query efficiency, confirming its effectiveness in generating robust counterfactuals under model multiplicity.
Methodology
The authors propose DensityFlow, a generative framework that models counterfactual generation as continuous-time dynamics using Neural ODEs. A differentiable density score is learned through Noise Contrastive Estimation, which helps navigate the high-density regions of the data manifold. For black-box models, a local proxy distillation strategy is employed to align a lightweight surrogate model with the target model during the counterfactual generation process.
Results
The experiments conducted on synthetic and real-world datasets indicate that DensityFlow achieves state-of-the-art performance in terms of robustness and validity of counterfactual explanations while significantly reducing the number of queries required compared to existing ensemble-based approaches.
Implications
The findings suggest that DensityFlow can enhance the interpretability and reliability of machine learning models, particularly in high-stakes decision-making scenarios. Its ability to generate robust counterfactuals efficiently could be beneficial in fields such as healthcare, finance, and any domain requiring transparent algorithmic recourse.
Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
NLP
Large Language Models
Efficient ML
- AdvCL repurposes adversarial perturbations for stable continual learning.
- The framework includes three modules: Intra-Smooth, Proto-Clip, and Inter-Align.
- Experiments show improvements in performance, robustness, and reduced forgetting.
- The modules can be integrated into various continual learning paradigms.
Read more
Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
Summary
This paper introduces AdvCL, a novel framework that repurposes adversarial perturbations to enhance continual learning (CL) in dynamic environments. The authors address common challenges in CL, such as forgetting, limited transfer, and vulnerability to adversarial inputs, by leveraging adversarial perturbations as a geometric control signal. AdvCL comprises three modules: Intra-Smooth, which promotes local smoothness through small perturbations; Proto-Clip, which prevents excessive alignment to the current task prototype; and Inter-Align, which facilitates directional alignment towards previous task prototypes to minimize representational gaps. The experiments demonstrate that AdvCL consistently improves both standard performance and robustness, leading to reduced forgetting and enhanced transfer capabilities. The authors also analyze the sensitivity of the modules to perturbation settings and their effects on task similarity and geometric distance, highlighting the complementary benefits of combining these modules. Overall, AdvCL offers a new perspective on utilizing adversarial perturbations to stabilize continual learning processes.
Methodology
The AdvCL framework integrates three distinct modules: Intra-Smooth for local smoothing through small perturbations, Proto-Clip for similarity clipping to avoid over-alignment with current task prototypes, and Inter-Align for directional alignment towards previous task prototypes. The framework maintains a prototype for each task and updates it using an exponential moving average, facilitating better adaptation across tasks.
Results
The experiments conducted reveal that AdvCL significantly enhances both standard performance and robustness in continual learning scenarios. It achieves lower forgetting rates and stronger knowledge transfer across tasks, demonstrating the effectiveness of the proposed modules when used individually or in combination.
Implications
The findings suggest that adversarial perturbations can be effectively utilized beyond their traditional defensive role, offering a new approach to stabilize continual learning. This could lead to more robust models in dynamic environments, with potential applications in various fields requiring continual adaptation, such as robotics, natural language processing, and real-time data analysis.
Drift Q-Learning
Reinforcement Learning
Generative Models
Efficient ML
- DriftQL combines a drift-based behavioral regularizer with critic-driven policy improvement.
- The method generates actions in a single forward pass, avoiding the need for iterative denoising.
- DriftQL outperforms existing diffusion and flow-based methods on standard benchmarks.
- It maintains performance under degraded data quality, unlike many baseline methods.
Read more
Drift Q-Learning
Summary
Drift Q-Learning (DriftQL) is a novel approach to offline reinforcement learning (RL) that addresses the challenge of improving policies from fixed datasets while avoiding unreliable out-of-distribution (OOD) actions. Traditional methods often rely on diffusion and flow policies, which require complex iterative denoising processes and additional networks for efficient inference. DriftQL introduces a drift-based behavioral regularizer combined with critic-driven policy improvement, allowing for a single network implementation that generates actions in one forward pass. The method utilizes a drift field that attracts actions towards observed data while preventing collapse into a single mode, thus maintaining diversity in high-value regions. Evaluations on D4RL and OGBench benchmarks demonstrate that DriftQL consistently outperforms existing diffusion and flow methods, particularly under degraded data quality, showcasing its robustness and efficiency. This positions DriftQL as a promising alternative to more complex generative models in offline RL.
Methodology
DriftQL employs a drift field that guides policy improvement by attracting actions towards observed dataset actions while repelling nearby generated actions. This mechanism preserves diversity in high-value regions and allows for efficient action generation in a single forward pass, avoiding the complexities associated with traditional diffusion and flow methods.
Results
DriftQL demonstrated superior performance on D4RL and OGBench datasets compared to diffusion and flow-based baselines. It maintained close performance to clean-data results even under degraded data quality, highlighting its robustness and efficiency.
Implications
The development of DriftQL could lead to more efficient and simpler implementations of offline reinforcement learning, making it accessible for various applications where data quality may be compromised. Its robustness under degraded conditions suggests potential use in real-world scenarios where data may not always be pristine.
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
Generative Models
Efficient ML
- Introduction of SITA, a scalable method for inference-time annealing in molecular sampling.
- Utilization of surrogate likelihood estimators to bypass expensive divergence calculations.
- Demonstration of state-of-the-art performance on alanine dipeptide and alanine tripeptide.
- Integration of a BoltzNCE-style surrogate into a temperature annealing framework.
Read more
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
Summary
This paper addresses the challenge of efficiently sampling the Boltzmann distribution of molecular configurations, a fundamental task in computational chemistry and biophysics. Traditional methods like Markov Chain Monte Carlo (MCMC) and molecular dynamics (MD) simulations are often computationally expensive and struggle with high-dimensional systems. The authors propose Scalable Inference-Time Annealing (SITA), a novel approach that retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model for fast surrogate likelihoods. This method eliminates the need for costly divergence computations typically required in existing importance sampling techniques. SITA is empirically validated on benchmark systems, specifically alanine dipeptide and alanine tripeptide, demonstrating state-of-the-art performance while avoiding the computational overhead associated with traditional methods. The authors provide their implementation in a publicly available code repository.
Methodology
SITA combines flow-based generative models with surrogate likelihood estimators to facilitate efficient inference-time annealing. The method involves generating proposals from a high-temperature Boltzmann distribution and using these to train an energy-based model. Importance-weighted resampling with learned surrogate likelihoods allows for sampling at lower temperatures without the need for complex Jacobian computations.
Results
SITA achieves state-of-the-art results in sampling efficiency and accuracy on alanine dipeptide and alanine tripeptide, outperforming existing methods while avoiding the computational burdens associated with divergence evaluations in traditional importance sampling techniques.
Implications
The proposed method has significant implications for computational chemistry and biophysics, enabling faster and more efficient sampling of molecular configurations. This could enhance the ability to analyze complex molecular systems and facilitate high-throughput studies in drug discovery and materials science.
Learning to Construct Practical Agentic Systems
Large Language Models
Optimization
Efficient ML
- Introduction of a modular framework for designing agentic systems using pseudo-tools.
- Demonstration that hand-constructed fixed workflows are faster and more accurate than dynamically-planned workflows.
- Development of novel learning methods that outperform traditional hand-engineered agents.
- Application of multi-objective optimization to jointly enhance cost efficiency and response quality.
Read more
Learning to Construct Practical Agentic Systems
Summary
This paper addresses the design and optimization of agentic systems based on large language models (LLMs), emphasizing the need for practicality in production environments. The authors propose a framework that enforces modularity through the use of 'pseudo-tools,' which allow LLMs to be called recursively with restricted contexts. The study involves hand-engineering agents for 19 diverse tasks across various domains, demonstrating that fixed workflows outperform dynamically-planned ones in terms of speed and accuracy. Furthermore, the authors introduce novel learning methods for optimizing the components of these systems, including pseudo-tools and workflows, which generally surpass hand-engineered agents. The modularity of the framework also enables the application of multi-objective optimization techniques to balance cost and response quality effectively. Overall, the paper presents a structured approach to creating efficient and effective agentic systems suitable for real-world applications.
Methodology
The authors developed a framework that utilizes pseudo-tools to modularize reasoning in agentic systems. They hand-engineered agents for various tasks and compared the performance of fixed workflows against dynamically-planned workflows. Additionally, they introduced learning methods for optimizing the components of the system and applied multi-objective optimization techniques.
Results
The study found that hand-constructed fixed workflows were generally cheaper and more accurate than dynamically-planned workflows. The novel learning methods proposed for optimizing pseudo-tools and workflows consistently outperformed traditional hand-engineered agents, demonstrating the effectiveness of the modular framework.
Implications
The findings suggest that adopting a modular approach to agentic system design can lead to more efficient and effective solutions in real-world applications. This could have significant implications for industries relying on automated systems, such as finance, healthcare, and planning.
Spatio-temporal stochastic graph-based learning for infectious disease forecasting
Graph Learning
Time Series
- Introduces a spatio-temporal stochastic graph-based model for infectious disease forecasting.
- Addresses the limitations of traditional models by incorporating stochastic processes.
- Demonstrates improved forecasting accuracy using real-world datasets for COVID-19 and chickenpox.
- Shows the model's adaptability to various geographical scales and population sizes.
Read more
Spatio-temporal stochastic graph-based learning for infectious disease forecasting
Summary
This paper presents a novel spatio-temporal stochastic graph-based architecture aimed at improving the forecasting of infectious disease cases, specifically COVID-19 and chickenpox. The authors highlight the limitations of existing spatio-temporal models, which often overlook stochastic processes and fail to account for the variability inherent in real-world disease spread across large geographical networks. The proposed model integrates a stochastic formulation and uncertainty approximation, allowing it to adapt to both large and small population networks. The authors validate their approach using real-world datasets, demonstrating enhanced predictive performance for COVID-19 in the US and chickenpox in Hungary. The results indicate that the model can effectively capture epidemic progression, although it exhibits a one-step delay in predictions and reduced sensitivity to high-frequency variability. This work emphasizes the importance of incorporating stochastic elements into epidemic forecasting models to better reflect the complexities of disease transmission.
Methodology
The authors developed a spatio-temporal stochastic graph-based learning model that organizes temporal epidemic data as features of graph nodes across geographical networks. The model incorporates stochastic outcomes and uses ensemble methods to estimate uncertainty, simulating multiple potential prediction trajectories.
Results
The proposed model outperformed four benchmark spatio-temporal graph-based models, achieving competitive weekly forecasting performance for all 3,218 US counties and 20 Hungarian counties. The model effectively represented epidemic progression relative to baselines, albeit with a one-step prediction delay.
Implications
This research has significant implications for public health planning and response, as it provides a more accurate tool for forecasting infectious disease spread. By integrating stochastic elements, the model can better inform decision-making processes related to epidemic management and resource allocation.
Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms
Optimization
Efficient ML
Theory
- Introduction of inner product aware quantization objectives (MDV and ADV).
- Development of adaptive and unbiased quantization methods that outperform traditional approaches.
- Algorithms designed are provably fast, achieving solutions within a (1 + ε) factor of optimal cost.
- Empirical results show 2-10x speed improvements over state-of-the-art methods.
Read more
Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms
Summary
This paper addresses the challenge of quantization in machine learning, particularly focusing on inner product aware quantization schemes that maintain the integrity of inner products with unseen vectors. Traditional quantization methods often minimize mean-squared error (MSE), which may not be suitable for applications that rely on inner products. The authors propose new objectives, Maximum Directional Variance (MDV) and Average Directional Variance (ADV), that aim to preserve inner products more effectively. They develop adaptive and unbiased quantization methods that optimize these objectives, leading to algorithms that are both fast and accurate. The theoretical analysis reveals a strong connection to Adaptive Stochastic Quantization (ASQ), and the proposed algorithms demonstrate significant speed improvements—2-10 times faster than existing methods—while maintaining high quality. The paper emphasizes the practical implications of these advancements in various applications, including dataset compression and neural network weight quantization.
Methodology
The authors formulate two new objectives for quantization: Maximum Directional Variance (MDV) and Average Directional Variance (ADV). They analyze these objectives theoretically and develop algorithms that optimize them. The algorithms are designed to be both exact and approximate, ensuring they are computationally efficient. The paper also includes empirical evaluations to compare the performance of the proposed algorithms against existing state-of-the-art methods.
Results
The proposed algorithms demonstrate significant speed improvements, achieving 2-10 times faster runtimes compared to previous methods while maintaining high quality in quantization. The algorithms effectively minimize the variance of inner products, outperforming traditional quantization techniques in practical applications.
Implications
The advancements in inner product aware quantization can lead to more efficient data compression techniques, improved neural network performance, and faster computations in machine learning applications that rely on inner products. This work has the potential to enhance various downstream tasks in machine learning, particularly those involving large datasets and complex models.
A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Reinforcement Learning
Large Language Models
Theory
- Cross-domain degradation is driven by sparse RL edits interacting along shared computation routes, not just by global gradient conflict.
- A local perturbation model reveals that degradation is concentrated in a low-dimensional shared conflict subspace.
- Short domain refresh can selectively recover performance in earlier domains with limited impact on others.
- The study provides empirical validation through task-level recovery and a training-free rollback method.
Read more
A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Summary
This paper investigates the phenomenon of cross-domain interference in multi-domain reinforcement learning (RL), particularly in the context of large language models (LLMs). The authors argue that traditional explanations, such as catastrophic forgetting and global gradient conflict, do not fully account for the selective degradation observed when training on multiple domains sequentially. They introduce a local perturbation theory that focuses on the interactions of sparse parameter edits along shared active computation routes. The study reveals that cross-domain degradation primarily arises from second-order damage in a low-dimensional shared conflict subspace, rather than from direct overlap among edited neurons. The authors demonstrate that a brief refresh of a previously trained domain can effectively restore performance while minimizing collateral damage to other domains. Experimental results show that a short refresh after sequential training can recover performance significantly, exemplified by an increase in Math performance from 57.66 to 66.04, while maintaining stability in other domains. Additionally, a training-free rollback method is proposed to validate the localized damage hypothesis, providing insights into the mechanisms of interference and recovery in multi-domain RL.
Methodology
The authors employed a local perturbation model to analyze the effects of sequential training on multiple domains. They conducted experiments to observe the performance changes across domains and validated their theoretical claims through a short domain refresh and a training-free rollback on proxy conflict coordinates.
Results
The results indicate that a brief refresh on the Math domain after training on Code, QA, and CW led to a recovery in Math performance from 57.66 to 66.04, achieving the best average score of 66.39 across all domains. The rollback method provided additional evidence for the localized nature of interference.
Implications
The findings suggest that understanding the localized mechanisms of interference can lead to more effective training strategies in multi-domain RL, potentially improving the performance of large language models across various tasks. This could have applications in enhancing model robustness and adaptability in real-world scenarios.
Learning Multi-Agent Coordination via Sheaf-ADMM
Optimization
Graph Learning
Robotics
- Introduces Sheaf-ADMM for multi-agent coordination with limited local views.
- Utilizes cellular sheaf theory to define inter-agent constraints for heterogeneous consensus.
- Demonstrates improved performance on tasks like maze pathfinding, image classification, and Sudoku.
- Enhances robustness to distribution shifts in MNIST classification compared to standard CNNs.
Read more
Learning Multi-Agent Coordination via Sheaf-ADMM
Summary
This paper introduces Sheaf-ADMM, a differentiable optimization framework designed for multi-agent coordination in scenarios where agents have limited local views of input data. The framework decomposes input into overlapping local views, allowing each agent to solve a convex subproblem using a neural encoder. Coordination among agents is achieved through the Alternating Direction Method of Multipliers (ADMM), with inter-agent constraints defined by a cellular sheaf. This sheaf structure allows agents to agree on specific aspects of their solutions, facilitating heterogeneous global consensus. The authors demonstrate the effectiveness of Sheaf-ADMM on various tasks, including maze pathfinding, image classification, and Sudoku, showing that agents can learn to coordinate effectively even with insufficient local information. Notably, the method improves robustness to distribution shifts in MNIST classification compared to standard CNNs and achieves higher solve rates in Sudoku compared to matched MPNN baselines. The ADMM structure also enables distinct analysis of primal, consensus, and dual state variables, offering insights into coordination dynamics not available in traditional message-passing architectures.
Methodology
The methodology involves formulating coordination as a constrained optimization problem solved using ADMM. Each agent independently solves local subproblems parameterized by a neural network encoder, followed by a consensus step that projects their proposals towards global consistency. The entire process is differentiable, allowing for backpropagation through the optimization trajectory.
Results
The evaluation of Sheaf-ADMM on tasks such as maze pathfinding, image classification (MNIST), and Sudoku shows that agents can effectively coordinate to produce correct global outputs despite limited local views. The method outperforms standard CNNs in robustness to distribution shifts and achieves significantly higher solve rates in Sudoku compared to parameter-matched MPNN baselines.
Implications
The findings suggest that Sheaf-ADMM can be applied to various multi-agent systems where coordination is essential, particularly in environments with limited local information. The framework's interpretability and distinct state variable structure may also facilitate further research into multi-agent dynamics and optimization.
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Optimization
Theory
- Generalizes non-uniform smoothness assumptions for better modeling of loss landscapes.
- Establishes convergence rates for steepest descent and adaptive methods like Adam and RMSProp.
- Demonstrates that Sign GD converges faster than traditional gradient descent for logistic regression.
- Shows that RMSProp and Adam can achieve linear convergence rates for certain neural networks.
Read more
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Summary
This paper investigates the convergence properties of first-order optimization methods, specifically steepest descent and adaptive methods like Adam and RMSProp, under a generalized non-uniform smoothness (NS) assumption. The authors extend the NS assumption to include objectives where the curvature is an affine function of the objective value, applicable to various machine learning problems such as logistic regression and certain neural networks. They establish convergence rates for steepest descent and diagonal variants of RMSProp and Adam, demonstrating that under their assumptions, these methods can achieve linear convergence rates without requiring convexity or bounded gradient conditions. The results indicate that Sign GD can outperform traditional gradient descent in specific scenarios, and that RMSProp and Adam can converge linearly with constant step sizes for two-layer neural networks. The paper also presents a lower bound showing that these methods are faster than other adaptive methods like AdaGrad and AMSGrad, highlighting their efficiency in practical applications.
Methodology
The authors derive convergence guarantees based on the (H0, H1)-NS and non-uniform Łojasiewicz (NL) assumptions. They analyze the structural properties of functions satisfying these assumptions and apply them to derive convergence rates for various optimization methods, including steepest descent and its normalized variants. The analysis is conducted without relying on dimension dependence, making the results broadly applicable.
Results
The paper establishes that steepest descent methods can achieve dimension-free linear convergence rates under the proposed assumptions. For logistic regression and softmax policy gradient objectives, Sign GD is shown to converge faster than traditional GD. Additionally, RMSProp and Adam are proven to converge linearly with constant step sizes for a class of two-layer neural networks, outperforming other adaptive methods like AdaGrad and AMSGrad.
Implications
The findings suggest that adopting the generalized non-uniform smoothness assumptions can lead to more efficient optimization strategies in machine learning tasks, particularly in scenarios involving logistic regression and neural networks. The results may influence the design of optimization algorithms in practice, promoting the use of adaptive methods that leverage the properties identified in this research.
Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity
Reinforcement Learning
Theory
Interpretability
- Introduces occupancy Rashomon capacity to quantify the complexity of auditing near-optimal RL policies.
- Establishes conditional lower bounds for exact local-query auditing, indicating potential exponential complexity.
- Demonstrates the significance of occupancy-class level auditing in distinguishing between behaviorally distinct policies.
- Provides a finite discounted hidden-branch MDP to illustrate theoretical findings and prove the exact Bayes success law.
Read more
Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity
Summary
This paper addresses the challenges of auditing near-optimal reinforcement learning (RL) policies, which can exhibit behavioral multiplicity while achieving similar task rewards. The authors introduce a formalization of this phenomenon through the concept of occupancy Rashomon capacity, which quantifies the complexity of distinguishing between behaviorally distinct but return-equivalent policies. The study establishes that auditing at the occupancy-class level is crucial, particularly when distinguishing between exact local-query oracles and noisy sample-query oracles. The authors derive conditional lower bounds on the number of queries required for exact local-query auditing, demonstrating that under certain conditions, the complexity can be exponential. They provide a finite discounted hidden-branch Markov Decision Process (MDP) to illustrate their findings and prove the exact Bayes success law. Additionally, they explore the implications of noisy hidden-trigger testing and establish a mixture lower bound related to the per-sample Kullback-Leibler (KL) signal. The paper also discusses the separation of verification from cover generation and introduces a canonical occupancy regularizer that can collapse audited capacity when a trusted reference occupancy is available. Controlled benchmarks are presented to differentiate between instances with positive sparse signatures and high-capacity negative controls, mapping the findings to continuous-control and visual RL auditing regimes.
Methodology
The authors utilize theoretical analysis to derive lower bounds on query complexity for auditing near-optimal policies. They employ concepts from metric entropy and occupancy measures, and apply rigorous arguments involving KL divergence and information theory to establish their results. The study includes the construction of a specific MDP to exemplify the theoretical bounds and validate the findings through controlled benchmarks.
Results
The paper presents several key results, including a conditional lower bound of Ω(M/b) queries for exact local-query auditing when certain conditions are met, and a mixture lower bound of order M/β for noisy hidden-trigger testing. The findings indicate that exponential hardness in auditing arises when high Rashomon capacity is coupled with sparse or hidden local signatures. The authors also demonstrate that the canonical occupancy regularizer can collapse audited capacity when a trusted reference occupancy is available.
Implications
The findings have significant implications for the design and auditing of reinforcement learning systems, particularly in environments where multiple near-optimal policies exist. Understanding the conditions under which auditing becomes intractable can guide the development of more robust auditing frameworks and inform the design of RL algorithms that minimize behavioral multiplicity, enhancing safety and interpretability.
What changes after deployment? A survey on On-device Learning in TinyML
Efficient ML
- ODL enables machine learning models to adapt to distribution changes post-deployment directly on devices.
- The survey categorizes distribution changes into three regimes: single-change, concept drift, and continual learning.
- There is a significant gap between theoretical benchmarks and real-world applications in ODL.
- Understanding the nature of distribution changes is crucial for developing effective ODL solutions.
Read more
What changes after deployment? A survey on On-device Learning in TinyML
Summary
This paper presents a comprehensive survey of On-device Learning (ODL) in the context of Tiny Machine Learning (TinyML), focusing on the challenges posed by distribution changes after deployment. Traditional machine learning models, once deployed on microcontroller-class devices, often fail to perform effectively due to shifts in data distribution that occur in real-world scenarios. ODL aims to address this issue by enabling learning processes to occur directly on the device, allowing models to adapt to new data distributions. The authors categorize the existing literature into three distinct distribution change regimes: single-change, concept drift, and continual learning. Each regime presents unique challenges and requirements for the learning algorithms and hardware used. The survey analyzes approximately 70 ODL works, highlighting the persistent gap between methodological benchmarks and practical deployment scenarios. By emphasizing the importance of understanding distribution changes, the paper provides a structured framework for evaluating and comparing ODL solutions, ultimately contributing to the advancement of adaptive TinyML systems.
Methodology
The authors conducted a systematic survey of the existing literature on ODL in TinyML, categorizing works based on the type of distribution change they address. They analyzed approximately 70 studies, focusing on how different change types influence applications, hardware, and solution structures.
Results
The survey identified three main distribution change regimes and highlighted the varying demands each regime places on applications and learning algorithms. It also revealed a gap between the ideal performance of ODL methods in controlled settings and their effectiveness in real-world deployments.
Implications
The findings suggest that future research in TinyML should prioritize understanding and addressing distribution changes to enhance the adaptability and performance of on-device learning systems. This could lead to more robust applications in areas such as wearables, industrial sensors, and other embedded systems.
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Reinforcement Learning
Large Language Models
Theory
- Introduces RLVR to improve LLM generation of verified programs and proofs.
- Achieves significant increases in verified rewards and pass rates through structured training.
- Identifies and addresses issues of specification hacking in model training.
- Develops a verifier-guided inference scaffold that enhances proof generation.
Read more
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Summary
This thesis addresses the challenges of automating formal verification for large language models (LLMs), particularly in the context of generating verified programs and proofs. The author proposes a novel approach that combines reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. The study begins by training open-source models in Dafny using RLVR, which significantly improves the verified reward from 2.2% to 58.1%. However, issues such as specification hacking were identified, where models exploit weak formal specifications. To mitigate this, the author filters out underspecified tasks and employs multi-turn RLVR, resulting in an improved verified pass rate from 9.7% to 31.1%. Additionally, a verifier-guided inference scaffold in Lean is developed, treating proof generation as a structured search over subgoals, leading to an increase in the pass rate from 46.2% to 69.2% on a pilot set. The study also introduces Dalek-Bench, a benchmark derived from the Rust curve25519-dalek verification project, although preliminary results indicate the need for stronger evaluation methods. Overall, the findings suggest that formal verifiers can enhance LLM performance when utilized as sources of reward and feedback, emphasizing the importance of clean data and robust specifications.
Methodology
The methodology involves training models using reinforcement learning techniques, specifically RLVR, to optimize the generation of verified programs. The author employs Group Relative Policy Optimization (GRPO) and filters tasks to eliminate vulnerabilities. Additionally, a verifier-guided inference scaffold is created to facilitate structured proof generation.
Results
The initial experiments showed an increase in verified rewards from 2.2% to 58.1%, and after refining the task set, the verified pass rate improved from 9.7% to 31.1%. The verifier-guided scaffold improved the pass rate from 46.2% to 69.2% on a pilot set, and the new benchmark Dalek-Bench was established, although results indicated room for improvement.
Implications
The findings suggest that integrating formal verification processes with LLMs can significantly enhance their ability to generate correct and verified outputs. This has potential applications in fields requiring high assurance in software correctness, such as cybersecurity and critical systems development.
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Reinforcement Learning
Large Language Models
Optimization
- Introduces Group Prioritized Off-Policy Optimization (POPO) to enhance RLVR for LLMs.
- Addresses the issue of ineffective training samples that lead to zero-variance rewards.
- Combines prioritized group replay and decoupled off-policy optimization for efficient learning.
- Empirical results show substantial improvements in reasoning tasks with fewer rollouts.
Read more
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Summary
This paper addresses the challenges faced in Reinforcement Learning with Verifiable Rewards (RLVR) when training large language models (LLMs). A significant issue is the prevalence of ineffective training samples, which leads to zero-variance rewards and limits learning signals. The authors propose a novel framework called Group Prioritized Off-Policy Optimization (POPO) that enhances data efficiency without incurring additional computational overhead from extensive rollouts. POPO consists of two main components: prioritized group replay, which replaces ineffective on-policy groups with effective off-policy groups based on recency and sample quality, and decoupled off-policy optimization, which employs importance sampling to correct off-policy bias while ensuring stable policy updates. Empirical evaluations demonstrate that POPO accelerates RL finetuning and achieves strong reasoning performance across various tasks, including mathematics and planning, with significantly fewer rollouts compared to existing methods.
Methodology
The methodology involves a two-component framework: prioritized group replay, which replaces ineffective on-policy response groups with effective off-policy groups based on recency and sample quality, and decoupled off-policy optimization, which uses importance sampling to mitigate off-policy bias while maintaining stable updates through consistent trust-region constraints.
Results
The empirical evaluations across diverse reasoning tasks indicate that POPO significantly accelerates the RL finetuning process and achieves performance comparable to state-of-the-art methods while requiring substantially fewer rollouts. The framework consistently improves reasoning performance across different response group sizes and base RLVR algorithms.
Implications
The proposed POPO framework has the potential to improve the efficiency and effectiveness of training large language models in various reasoning tasks, making it a valuable tool for researchers and practitioners in the field of reinforcement learning and natural language processing.
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
NLP
Large Language Models
Efficient ML
- PTQ exacerbates overthinking in reasoning models, leading to failures in 52% of cases where correct intermediate answers are abandoned.
- High KL divergence tokens, particularly hesitation and branching markers, are identified as key contributors to overthinking errors.
- A training-free logit penalty on overthinking markers reduces CoT length by 12-23% while maintaining or improving accuracy.
- Controlled ablations confirm that targeting overthinking markers yields the best efficiency-performance balance.
Read more
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Summary
This paper investigates the impact of post-training quantization (PTQ) on reasoning models, particularly in the context of large language models (LLMs). The authors find that aggressive quantization not only reduces accuracy but also increases the length of chain-of-thought (CoT) reasoning. Surprisingly, in up to 52% of cases where quantized models fail, they actually arrive at the correct answer during intermediate reasoning steps but fail to output it as the final answer due to overthinking. The study employs token-level KL divergence analysis to identify that positions with high divergence correlate with high next-token entropy, leading to an increased sampling of overthinking markers like 'wait' and 'but'. To mitigate this issue, the authors propose a training-free logit penalty on overthinking markers, which effectively reduces CoT length by 12-23% while preserving or improving accuracy across various models and benchmarks. This intervention highlights a favorable trade-off between accuracy and reasoning cost, demonstrating that quantized models do not fail due to a lack of reasoning capability, but rather due to excessive deliberation.
Methodology
The authors conducted a KL divergence analysis between the output distributions of quantized and full-precision models to identify tokens that contribute to overthinking. They then applied a logit penalty on a curated set of overthinking markers during decoding to reduce CoT length without additional computational overhead.
Results
The introduction of the logit penalty consistently reduced CoT length by 12-23% across five models and various quantization methods, while preserving or improving accuracy. The study also found that overthinking errors were reduced by up to 58% in quantized models.
Implications
The findings suggest that improving the efficiency of reasoning in quantized models can lead to better deployment of large language models in practical applications. The proposed logit penalty could be a valuable tool for enhancing model performance without requiring retraining.
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
NLP
Theory
Interpretability
- Introduces a semantic approach to dataset-level membership inference, moving beyond behavioral evidence.
- Develops Semantic Correlation Descriptors (SCDs) to capture and compare semantic correlation structures across datasets.
- Proposes a practical membership score that does not require leave-one-dataset-out models.
- Achieves superior performance compared to existing black-box and white-box methods in various experimental settings.
Read more
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
Summary
The paper introduces a novel approach for identifying training datasets based on the semantic correlation structures that models learn during training. The authors propose Semantic Correlation Descriptors (SCDs) as a method to capture these structures, which can reveal dataset-specific traces in a model's behavior. Unlike traditional methods that rely on behavioral evidence such as confidence scores or prediction margins, the SCD approach focuses on the internal semantic associations learned by the model. The authors demonstrate that SCDs can effectively distinguish between matching and non-matching dataset pairs in a controlled setting. They also propose a practical membership score that utilizes SCDs to determine if a target dataset was part of the training mixture of a model, without the need for leave-one-dataset-out models. The effectiveness of this approach is validated across three diverse experimental settings: natural language inference, emotion classification, and medical text classification, showing significant improvements over existing methods.
Methodology
The authors developed Semantic Correlation Descriptors (SCDs) to summarize the semantic correlation structures learned by models. They conducted a controlled leave-one-dataset-out diagnostic to validate the effectiveness of SCDs in recovering dataset-specific changes. A practical membership score was then proposed, which only requires the model's SCD and the standalone SCD of the target dataset to assess membership.
Results
The idSCD classifier, based on the proposed membership score, achieved the highest average performance and lowest standard deviation across three experimental settings, outperforming black-box baselines (RMIA, Attack-P, LiRA) and the white-box SIF baseline. The largest relative gain in ROC-AUC exceeded 60% when dataset groups exhibited distinct semantic characteristics.
Implications
This work has significant implications for model auditing, privacy, and accountability in machine learning. By enabling the identification of training datasets, it can help mitigate issues related to benchmark contamination and enhance reproducibility in research.
Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
Generative Models
- PATHS improves initialization for inference-time reward alignment in generative models.
- The method utilizes parallel tempering to explore complex reward landscapes effectively.
- Periodic Metropolis swaps between chains enhance the sampling of high-reward states.
- Experiments show consistent performance gains over existing SMC-based methods.
Read more
Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
Summary
This paper introduces PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method for inference-time reward alignment in generative models. The authors identify limitations in existing Sequential Monte Carlo (SMC) methods, which often initialize particles from standard priors, leading to poor performance in complex reward landscapes characterized by rare high-reward regions and multi-modal distributions. PATHS addresses these issues by employing parallel tempering, which maintains multiple sampling chains at different temperatures. This approach allows for efficient exploration of the reward landscape through periodic Metropolis swaps, enabling the transfer of high-reward states from exploratory chains to more stable chains. The authors demonstrate that PATHS significantly improves the sampling of rare, high-reward regions, enhancing the alignment quality in tasks such as layout-to-image generation and quantity-aware generation. Experimental results show that PATHS consistently outperforms prior methods, highlighting the importance of robust initialization and cross-mode exploration in complex reward settings.
Methodology
The proposed PATHS method leverages parallel tempering to run multiple sampling chains at varying temperatures. Higher-temperature chains explore the reward landscape more freely, while lower-temperature chains focus on stable reward-aware posteriors. Metropolis swaps are periodically performed to exchange high-reward states between chains, facilitating better exploration and initialization.
Results
PATHS was evaluated on layout-to-image and quantity-aware generation tasks, demonstrating significant improvements in alignment quality compared to existing methods like TDS, DAS, and Ψ-Sampler. The results indicate that PATHS effectively addresses the challenges posed by rare and multi-modal reward landscapes.
Implications
The findings suggest that robust initialization and exploration strategies are crucial for effective inference-time reward alignment in generative models. This work could lead to advancements in various applications requiring high-quality generative outputs aligned with user-specified rewards.
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Theory
- Chem-PerturBridge integrates a vast amount of transcriptomic data from diverse sources, providing a unified resource for small-molecule perturbation studies.
- The study reveals that while fine-grained logFC agreement across datasets is weak, the direction of logFC is more consistent.
- Embeddings pretrained on Chem-PerturBridge significantly improve performance in compound representation learning compared to existing methods.
- The resource supports both diagnostic evaluations of cross-dataset agreement and model-oriented reuse of heterogeneous data.
Read more
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Summary
The paper introduces Chem-PerturBridge, a comprehensive and harmonized resource designed to facilitate the training and evaluation of small-molecule transcriptomic perturbation models. It integrates over 37,000 compounds, 136 cellular contexts, and 1.25 million transcriptomic samples across various assay types, including bulk RNA-seq and single-cell data. The authors address the fragmentation of existing transcriptomic resources, which differ in technologies, metadata conventions, and preprocessing methods, making it challenging to compare and utilize these datasets effectively. Chem-PerturBridge standardizes compound identifiers, cellular contexts, doses, and other metadata, allowing for a more coherent analysis of perturbation effects. The study evaluates the agreement of matched conditions across datasets and finds that while fine-grained log fold change (logFC) rankings show weak agreement, the direction of logFC is more stable. Additionally, the resource is tested for its utility in pretraining models for compound representation learning, demonstrating that embeddings trained on Chem-PerturBridge outperform those trained on other datasets. This work not only provides a valuable resource for researchers but also highlights the importance of harmonization in transcriptomic data for improving predictive modeling.
Methodology
The authors constructed Chem-PerturBridge by harmonizing multiple datasets, standardizing metadata, and performing differential gene expression analysis to create condition-level perturbation effects. They evaluated dataset agreement through matched-condition benchmarks and tested the utility of the resource for pretraining models in compound representation learning.
Results
The analysis showed that matched same-compound conditions exhibited weak agreement in logFC rankings across datasets, while the direction of logFC was more stable. Models trained on Chem-PerturBridge outperformed or matched those trained on other datasets in various evaluations, indicating the resource's effectiveness for improving predictive modeling.
Implications
Chem-PerturBridge provides a critical tool for researchers in pharmacology and systems biology, enabling better integration and analysis of transcriptomic data across different assays. It can enhance the development of predictive models for drug response and toxicity, ultimately aiding in therapeutic discovery.
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence
NLP
Large Language Models
Efficient ML
- BaLoRA improves convergence rates by enforcing balanced low-rank adapters during optimization.
- Theoretical analysis shows that balanced minimizers have optimal conditioning, leading to faster convergence.
- Empirical results demonstrate that BaLoRA outperforms standard LoRA and matches or exceeds state-of-the-art LoRA variants.
- The method is computationally efficient and compatible with existing fine-tuning frameworks.
Read more
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence
Summary
This paper introduces Balanced Low-Rank Adaptation (BaLoRA), a novel extension of the widely used Low-Rank Adaptation (LoRA) method for fine-tuning large language models. The authors identify that LoRA is inherently overparameterized, leading to multiple pairs of low-rank factors that can yield the same adapted weight matrix but exhibit different condition numbers. This variance in conditioning affects the convergence rates of optimization. BaLoRA addresses this issue by projecting the low-rank adapters onto a balanced manifold during training, which enhances the conditioning of the loss landscape while maintaining the adapted matrix. The authors provide both theoretical and empirical evidence that BaLoRA converges faster than standard LoRA and achieves superior performance across various fine-tuning tasks. The method is computationally lightweight and integrates seamlessly into existing fine-tuning pipelines, making it a practical choice for researchers and practitioners in the field.
Methodology
The authors analyze the asymptotic behavior of LoRA's training dynamics and establish bounds on the convergence rate by examining the conditioning of the loss landscape. They introduce BaLoRA, which incorporates a projection step onto a balanced manifold after each optimization step, ensuring that the low-rank adapters maintain optimal conditioning throughout training.
Results
BaLoRA consistently outperformed standard LoRA in various experiments involving large language models and datasets, demonstrating faster convergence and improved performance metrics. The authors also reformulated BaLoRA iterations as an intrinsic optimization scheme, providing a clearer geometric interpretation of the algorithm.
Implications
The findings suggest that BaLoRA can significantly enhance the efficiency of fine-tuning large language models, making it a valuable tool for researchers and practitioners in natural language processing and related fields. Its compatibility with existing frameworks allows for easy adoption in practical applications.
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Reinforcement Learning
Optimization
Theory
- Identification of 'zero collapse' as a failure mode in policy gradient methods due to discontinuous reward landscapes.
- Mechanistic explanation of how flat zero-reward regions lead to vanishing gradient signals and sample inefficiency.
- Empirical demonstration of zero collapse across multiple policy gradient methods.
- Proposed mitigation strategies to enhance stability and learning speed in reinforcement learning.
Read more
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Summary
This paper investigates a critical failure mode of policy gradient methods in reinforcement learning, termed 'zero collapse', which occurs in environments characterized by discontinuous reward landscapes. The authors focus on bidding in repeated auctions as a case study, where rewards are structured in a thresholded manner. In such environments, agents receive no reward until their actions exceed a certain threshold, leading to large flat regions of zero reward separated by sharp transitions to high-reward areas. The authors demonstrate that policy gradient methods, driven by stochastic exploration, can overshoot optimal regions and become trapped in these flat zero-reward areas, resulting in ineffective learning dynamics. The paper provides a mechanistic explanation for this phenomenon, highlighting the interaction between policy stochasticity and step size, and empirically validates the occurrence of zero collapse across various policy gradient methods, including REINFORCE and actor-critic variants. The authors propose practical strategies to mitigate this issue, such as improved initialization schemes and architectural choices, and introduce a formal framework for reinforcement learning in auction environments, emphasizing the unique structural properties of these settings.
Methodology
The authors conducted theoretical analysis and empirical experiments to explore the zero collapse phenomenon. They examined the interaction between policy stochasticity and step size, and tested various policy gradient methods, including REINFORCE and actor-critic approaches, in environments with discontinuous reward structures. They also proposed practical strategies to mitigate the identified issues.
Results
The study found that policy gradient methods are susceptible to zero collapse, particularly in environments with discontinuous rewards. The empirical results showed that once policies enter flat zero-reward regions, recovery is highly sample-inefficient, leading to stalled learning. The proposed mitigation strategies improved the stability and learning speed of the agents in these challenging environments.
Implications
The findings have significant implications for the design of reinforcement learning algorithms, particularly in applications involving auction environments and other decision-making scenarios with discontinuous rewards. The proposed strategies can help improve the robustness and efficiency of learning in such settings, potentially enhancing performance in real-world applications like digital advertising.
Improving Selective Classification with Pairwise Queries for Binary Classification
NLP
Large Language Models
Theory
- Selective classification can waste expert resources if confidence estimates are unreliable.
- Pairwise queries provide a more accurate measure of sample quality than confidence estimates.
- The proposed method improves accuracy on non-rejected samples while reducing costs.
- Theoretical conditions for the effectiveness of pairwise queries are established.
Read more
Improving Selective Classification with Pairwise Queries for Binary Classification
Summary
This paper addresses the challenges of selective classification in binary classification tasks, particularly when using large language models (LLMs). Selective classification allows models to predict labels for samples they are confident about while abstaining from uncertain predictions, which are then labeled by experts at a cost. The authors identify that the confidence estimates from models can often be inconsistent with actual predictions, leading to high error rates on non-rejected samples. To mitigate this issue, the authors propose utilizing pairwise queries, where the model is asked to compare two unlabeled samples and determine which is closer to a specific label. This method is shown to be more reliable than using raw confidence estimates. Theoretical foundations are established to demonstrate the conditions under which pairwise queries outperform traditional confidence estimates. Extensive experiments on synthetic and real datasets confirm that the proposed approach yields a better accuracy-cost tradeoff compared to existing methods that rely solely on confidence estimates.
Methodology
The authors propose a pairwise query approach for selective classification in binary classification tasks. They establish theoretical conditions under which this method outperforms traditional confidence-based approaches. The methodology involves sending pairs of unlabeled data points to the model and asking which label is closer to the target label, thereby leveraging the model's comparative judgment rather than relying on potentially flawed confidence scores.
Results
The experiments conducted on one synthetic dataset and four real-world binary classification datasets show that the pairwise query method significantly improves the accuracy-cost tradeoff compared to methods that use raw confidence estimates. The results indicate that the proposed approach effectively reduces errors on non-rejected samples, validating the theoretical claims made by the authors.
Implications
The findings suggest that pairwise queries can enhance selective classification strategies, particularly in applications involving large language models where traditional confidence measures may fail. This approach could be beneficial in domains such as healthcare, finance, and any area where expert labeling is costly and selective classification is critical.
UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
Computer Vision
Theory
Optimization
- UR-JEPA introduces a new regularization method based on uniform rectifiability to prevent representation collapse in JEPAs.
- The method targets a uniformly n-rectifiable measure, contrasting with the isotropic Gaussian target of LeJEPA.
- Empirical results show UR-JEPA outperforms LeJEPA in terms of accuracy and lower seed variance across multiple datasets.
- The geometric properties of the embeddings produced by UR-JEPA are significantly different from those of LeJEPA, indicating a more structured representation.
Read more
UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
Summary
This paper introduces UR-JEPA, a novel approach to Joint-Embedding Predictive Architectures (JEPAs) that addresses the issue of representation collapse during training. The existing method, LeJEPA, utilizes Sketched Isotropic Gaussian Regularization (SIGReg) to enforce an isotropic Gaussian target on embeddings, which conflicts with the manifold hypothesis that suggests embeddings should concentrate on a lower-dimensional subset of the ambient space. UR-JEPA proposes a new regularization strategy that targets a uniformly n-rectifiable measure of local tangent dimension at small scales, implemented through a Gaussian-kernel smoothed Carleson-type square function (LCGLT) and a complementary Jones β-number formulation. Empirical evaluations on datasets such as Inet10 and EuroSAT demonstrate that UR-JEPA achieves improved accuracy and reduced variance compared to LeJEPA, while maintaining distinct geometric properties in the embedding space. The findings indicate that UR-JEPA effectively captures the geometric structure of the data, leading to more robust representations.
Methodology
UR-JEPA employs a Gaussian-kernel smoothed Carleson-type square function and a Jones β-number formulation to regularize the embeddings, ensuring they adhere to a uniformly n-rectifiable measure. This approach is designed to maintain the embeddings' geometric structure while preventing collapse during training.
Results
On the Inet10 dataset, UR-JEPA achieved an accuracy of 0.9141 ± 0.0014, representing a +0.83 percentage point improvement over LeJEPA with approximately 30% lower seed standard deviation. In other datasets like Galaxy10 SDSS and EuroSAT, UR-JEPA maintained competitive accuracy while demonstrating lower variance in results.
Implications
The findings suggest that UR-JEPA could be applied in various self-supervised learning scenarios where representation collapse is a concern. Its ability to produce structured embeddings may enhance performance in tasks requiring robust feature extraction, such as image classification and remote sensing.
Calibrated Preference Learning: The Case of Label Ranking
Theory
Reinforcement Learning
- Introduces calibration notions specifically for probabilistic label ranking, extending beyond multi-class classification.
- Establishes a theoretical framework showing the relationships between different calibration notions.
- Empirically evaluates the calibration properties of popular label ranking models, revealing significant calibration issues.
- Finds a strong correlation between calibration and benchmark accuracy in RLHF reward models.
Read more
Calibrated Preference Learning: The Case of Label Ranking
Summary
This paper addresses the issue of calibration in probabilistic label ranking (ProLR), which has not been formally studied despite its importance for reliable decision-making. Calibration ensures that predicted probabilities align with true outcome frequencies, a concept well-explored in classification and regression but lacking in label ranking. The authors introduce a hierarchy of calibration notions that encompass full rankings, sub-rankings, and top-k rankings, proving that full-rank calibration implies the others but not vice versa. They empirically demonstrate that popular label ranking models often exhibit poor calibration, highlighting significant differences between sub-ranking and top-k metrics. The study also applies its calibration framework to reinforcement learning from human feedback (RLHF) reward models, revealing a strong correlation between calibration and benchmark accuracy, indicating that calibration captures a meaningful quality dimension beyond mere top-1 accuracy. These findings underscore the need for further research into the effects of miscalibration and the development of correction methods.
Methodology
The authors develop a hierarchy of calibration notions for ProLR, theoretically investigating the relationships between these notions. They conduct empirical evaluations of popular label ranking models, assessing their calibration properties and comparing sub-ranking and full-ranking metrics. The framework is applied to RLHF reward models to analyze the correlation between calibration and accuracy.
Results
The study finds that popular label ranking models are often poorly calibrated, with substantial differences observed between sub-ranking and top-k calibration metrics. The empirical analysis shows that calibration correlates strongly with benchmark accuracy, suggesting that it captures a significant quality dimension beyond just top-1 accuracy.
Implications
The findings suggest that improving calibration in label ranking models could enhance their reliability and effectiveness in applications such as reinforcement learning from human feedback. Understanding miscalibration's downstream effects could lead to better decision-making processes in various machine learning applications.
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Time Series
- The paper benchmarks five UQ methodologies for TGT prediction in engine health management.
- A unified experimental framework is used for hyperparameter selection and performance evaluation.
- Distinct trade-offs in interval coverage, width, and stability are identified among the methods.
- The results provide practical guidance for selecting UQ methods in real-world applications.
Read more
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Summary
This paper addresses the critical need for accurate turbine gas temperature (TGT) predictions and robust uncertainty quantification (UQ) methodologies in the context of engine health management (EHM). The authors benchmark five prominent UQ approaches—Delta method, Bayesian Monte Carlo Dropout, Bootstrap method, Lower–Upper Bound Estimation, and Mean–Variance Estimation—within a unified experimental framework. The study employs cross-validation for hyperparameter tuning and multiple metrics, including Coverage Probability, Normalized Mean Prediction Interval Width, and Coverage Width-based Criterion, to evaluate the performance of each method. The experiments utilize a representative dataset of turbine gas temperatures, revealing distinct trade-offs among the methods in terms of interval coverage, width, and stability. The findings serve as a practical guide for selecting and tuning prediction interval methods, enhancing the interpretability and precision of TGT predictions in real-world applications, particularly in aerospace operations where safety and reliability are paramount.
Methodology
The authors implemented five UQ methods within a unified framework that included cross-validation for hyperparameter selection and repeated train-test splits for robustness. They evaluated the methods using metrics such as Coverage Probability, Normalized Mean Prediction Interval Width, and Coverage Width-based Criterion to assess the reliability and sharpness of prediction intervals.
Results
The experiments demonstrated that each UQ method exhibited unique strengths and weaknesses regarding interval coverage, width, and stability. The findings highlighted the necessity of selecting appropriate UQ methodologies based on specific operational contexts and requirements.
Implications
The study's findings have significant implications for engine health management, particularly in aerospace, where accurate TGT predictions and uncertainty quantification are crucial for ensuring safety and reliability. The insights can guide practitioners in making informed maintenance decisions and risk assessments.
Rethinking the Role of Temperature in Large Language Model Distillation
NLP
Large Language Models
Theory
- Temperature plays a crucial role in the effectiveness of distillation objectives in LLMs.
- FKL can outperform RKL at higher temperatures, contradicting the common belief that RKL is inherently superior.
- Temperature enhances knowledge transfer by enriching non-dominant token signals in FKL.
- The impact of temperature extends beyond FKL to improve various KL-based distillation objectives.
Read more
Rethinking the Role of Temperature in Large Language Model Distillation
Summary
This paper revisits the role of temperature in the distillation of large language models (LLMs), challenging the prevailing preference for reverse Kullback-Leibler (RKL) divergence over forward KL (FKL) divergence. The authors argue that previous comparisons have largely ignored the impact of temperature, which is crucial for softening teacher distributions and enhancing knowledge transfer. Through theoretical analysis, they demonstrate that temperature significantly alters the dynamics between FKL and RKL. Specifically, while RKL primarily rescales gradients, FKL benefits more from temperature scaling, particularly at higher temperatures. This leads to the surprising conclusion that FKL can outperform RKL in LLM distillation when temperature is increased. Furthermore, the authors show that temperature positively influences a broader range of KL-based distillation objectives, enabling simpler methods to achieve competitive performance against state-of-the-art approaches. Overall, the findings suggest that temperature is a critical factor in the design and evaluation of distillation methods for LLMs.
Methodology
The authors conducted a theoretical analysis of the effects of temperature on FKL and RKL divergence in the context of knowledge distillation. They examined the gradients of both objectives under varying temperature settings and established the conditions under which FKL outperforms RKL. The analysis included empirical evaluations across instruction-following benchmarks to validate their theoretical findings.
Results
The analysis revealed that while RKL performs better at a temperature of 1, FKL consistently surpasses RKL at higher temperatures across various benchmarks. Additionally, the study demonstrated that temperature improves not only FKL but also other KL-based objectives, allowing simpler methods to achieve competitive performance against advanced LLM distillation techniques.
Implications
The findings suggest that researchers and practitioners should reconsider the role of temperature in LLM distillation, potentially leading to improved distillation strategies and better performance of smaller models trained on larger teacher models. This could have significant implications for model compression and efficiency in natural language processing tasks.
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Theory
- Padded transformers are robust to changes in attention type, model width, and uniformity.
- Numeric precision and model depth are the main factors affecting expressivity.
- Polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC0.
- Increasing width or precision beyond logarithmic levels does not enhance expressivity.
Read more
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Summary
This paper investigates the expressivity of padded transformers, which utilize filler symbols in their input to enhance computational capabilities. The authors analyze how various architectural choices, such as attention type, model width, and uniformity, affect the expressivity of these transformers. They find that padded transformers exhibit surprising robustness to these changes, with numeric precision and model depth emerging as the primary factors influencing expressivity. The study establishes that polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC0, while those with growing precision achieve L-uniform TC0, independent of width. Additionally, the paper reveals that increasing width or precision beyond logarithmic levels does not enhance expressivity. The authors also demonstrate that looping mechanisms allow for sequential processing akin to circuits, leading to significant expressivity results. Overall, the findings suggest that certain architectural choices may not significantly impact expressivity, simplifying theoretical analyses and potentially guiding practical implementations.
Methodology
The authors conducted a comprehensive analysis of padded transformers by examining various architectural configurations, including attention types (softmax and average hard attention), numeric precision, model width, and uniformity. They established theoretical equivalences to boolean circuit classes and explored the implications of these configurations on expressivity through mathematical proofs and comparisons.
Results
The study found that padded transformers maintain expressivity across different configurations, with specific results indicating that constant-precision transformers are limited to AC0, while growing-precision transformers achieve TC0. The research also highlighted that log-precision padded transformers consistently outperform constant-precision ones, and that expressivity is not significantly affected by changes in attention type or model width once logarithmic precision is reached.
Implications
The findings have significant implications for both theoretical research and practical applications of transformers. They suggest that researchers can focus on simpler models for analysis without losing expressivity insights. Practitioners may also benefit from understanding which architectural choices are critical for performance, potentially leading to more efficient transformer designs in real-world applications.
When Data Is Scarce: Scaling Sparse Language Models with Repeated Training
NLP
Large Language Models
Efficient ML
- Introduction of a scaling law for sparse training under data constraints.
- Demonstration of delayed data saturation, making multi-epoch training more effective.
- Identification of resource trade-offs between loss-optimal and compute-optimal sparsity.
- Sparsity improves both data utilization and parameter efficiency.
Read more
When Data Is Scarce: Scaling Sparse Language Models with Repeated Training
Summary
This paper investigates the scaling behavior of sparse language models under data-constrained conditions, where the availability of unique training tokens is limited. The authors propose a new scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity. They conduct extensive experiments with models up to 1.92 billion parameters, achieving sparsity levels of up to 93.75% and utilizing unique data budgets of up to 2.6 billion tokens across 16 epochs. The findings reveal that sparse training enhances data utilization efficiency, postpones diminishing returns from repeated data, and provides insights into optimal sparsity levels for efficient pre-training. The study highlights that sparsity is not merely a tool for efficiency but also a mechanism to improve scaling trade-offs in scenarios with limited data, suggesting that the optimal sparsity level can vary based on the data scale and compute resources available.
Methodology
The authors conducted experiments varying model size, sparsity levels, and data budgets. They analyzed the scaling behavior of sparse training through a series of controlled experiments, fitting scaling laws to the performance metrics observed across different configurations. The study also involved extrapolating results to larger models to validate the findings.
Results
The experiments showed that sparse training increases the effective data saturation scale, leading to slower saturation rates and improved performance. Sparse models demonstrated a significant reduction in the parameter-to-token ratio, achieving comparable accuracy to dense models while using substantially fewer training resources. The findings suggest that compute-optimal sparsity can facilitate large-scale training at a lower cost.
Implications
The insights from this research could inform the design of future language models, particularly in scenarios where data is scarce. By leveraging sparsity, researchers and practitioners can optimize model training processes, potentially leading to more efficient and effective language models in various applications.
Regularized Large Neighborhood Search
Optimization
Theory
Efficient ML
- Introduces RLNS, bridging LNS with Gibbs sampling for combinatorial optimization in neural networks.
- Proves that RLNS under entropic regularization performs exact block Gibbs sampling.
- Demonstrates the ability to interpolate between pseudolikelihood and exact maximum likelihood estimation.
- Evaluates RLNS on multiple NP-hard combinatorial problems, showing its practical applicability.
Read more
Regularized Large Neighborhood Search
Summary
This paper introduces Regularized Large Neighborhood Search (RLNS), a novel approach that integrates large neighborhood search (LNS) heuristics with combinatorial optimization in neural networks. Traditional methods often rely on exact global solutions, which are computationally intractable for NP-hard problems. RLNS addresses this limitation by transforming LNS into an efficient Markov Chain Monte Carlo (MCMC) sampler over feasible solutions, utilizing entropic regularization. The authors demonstrate that RLNS performs exact block Gibbs sampling under this regularization and allows for interpolation between pseudolikelihood and exact maximum likelihood estimation. The approach is evaluated on various combinatorial problems, including k-subset selection, generalized assignment, and stochastic vehicle scheduling, showcasing its effectiveness in providing scalable solutions without the need for global solvers.
Methodology
The authors develop RLNS by incorporating local regularization into LNS subproblems, allowing it to function as a stochastic optimization method. They prove the connection between RLNS and Gibbs sampling, enabling the use of local oracles instead of global ones. The methodology includes perturbation-based regularization and iterative refinement of solutions through local optimization.
Results
The evaluation of RLNS on k-subset selection, generalized assignment, and stochastic vehicle scheduling problems demonstrates its capability to effectively approximate solutions to NP-hard problems. The results indicate that RLNS can achieve performance comparable to exact methods while maintaining scalability and efficiency.
Implications
The proposed RLNS framework has significant implications for operations research and machine learning, particularly in scenarios where exact solutions are infeasible. It opens avenues for integrating combinatorial optimization into neural networks, enhancing their ability to handle structured prediction tasks with complex constraints.
A Unifying View of Variational Generative Wasserstein Flows
Generative Models
Optimization
Theory
- Introduction of Generative Wasserstein Flows (GWF) as a unified framework for generative modeling.
- Derivation of various generative methods as instances of parametric JKO schemes for f-divergences.
- Extension of the JKO framework to Integral Probability Metrics and squared Maximum Mean Discrepancy.
- Empirical analysis of JKO regularization effects on generative model training.
Read more
A Unifying View of Variational Generative Wasserstein Flows
Summary
This paper presents a unified theoretical framework for generative modeling based on Wasserstein gradient flows, termed Generative Wasserstein Flows (GWF). The authors demonstrate that a wide class of existing generative methods can be derived from parametric Jordan–Kinderlehrer–Otto (JKO) schemes for f-divergence objectives. They establish equivalences between various recently proposed algorithms and extend the framework to Integral Probability Metrics and squared Maximum Mean Discrepancy, leading to new JKO-based generative algorithms. The paper also empirically studies the impact of JKO regularization across a range of objectives and analyzes parametric Wasserstein flows, where the dynamics are constrained to distributions induced by parameterized maps. This work aims to clarify the connections between different generative modeling approaches, including GANs, and provides a comprehensive understanding of their underlying geometric structures.
Methodology
The authors utilize Wasserstein gradient flows and the Jordan–Kinderlehrer–Otto (JKO) scheme to derive generative algorithms from f-divergence minimization. They extend their framework to include Integral Probability Metrics and squared Maximum Mean Discrepancy, and conduct empirical studies to assess the impact of JKO regularization on generative models.
Results
The paper establishes theoretical equivalences between various generative modeling methods and demonstrates that the proposed GWF framework can encompass and clarify these connections. Empirical results indicate that JKO regularization positively influences the training of generative models across multiple objectives.
Implications
This work provides a comprehensive understanding of generative modeling techniques, potentially guiding the design of more effective generative algorithms. The unification of various methods under the GWF framework may lead to improved performance and faster sampling in generative tasks.
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Graph Learning
Time Series
- GC-MoE introduces a dual-pathway router that combines static topology features with dynamic input representations for expert selection.
- The framework leverages frozen pretrained experts, allowing for low-parameter training while utilizing a diverse set of models.
- An optional output refinement layer can enhance performance at minimal additional parameter cost.
- The study includes an ablation analysis to evaluate the effectiveness of lightweight extensions and their interaction with routing mechanisms.
Read more
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Summary
The paper presents GC-MoE, a novel framework for spatio-temporal forecasting on sensor graphs, particularly for traffic prediction. Traditional approaches often apply a single backbone architecture uniformly across all nodes, which may not capture the distinct dynamics exhibited by different road segments. GC-MoE addresses this limitation by employing a graph-conditioned mixture of experts strategy, where each node is assigned a personalized combination of frozen forecasting experts based on the graph's topology and recent traffic inputs. The framework integrates multiple pretrained spatio-temporal graph neural network (GNN) experts with a lightweight, input-aware routing mechanism that adapts to current traffic conditions. The authors also explore an optional output refinement layer and node-adaptive ST-LoRA adapters for further performance enhancement. Experimental results across four standard benchmarks demonstrate that GC-MoE significantly improves mean absolute error (MAE) over a zero-parameter ensemble baseline while maintaining competitive results in root mean square error (RMSE) and mean absolute percentage error (MAPE), all while training only approximately 17,000 parameters on top of 1.5 million frozen expert weights.
Methodology
GC-MoE employs a modular framework that pretrains multiple diverse spatio-temporal GNN experts, freezes them, and learns a routing mechanism that assigns expert weights based on both static and dynamic inputs. The routing mechanism is designed to adapt to current traffic conditions, enhancing the model's ability to handle varying dynamics across different nodes.
Results
The experimental evaluation on four benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY) shows that GC-MoE outperforms a zero-parameter ensemble baseline in terms of MAE, while also achieving competitive RMSE and MAPE scores. The model effectively utilizes only about 17,000 trainable parameters in conjunction with 1.5 million frozen expert weights.
Implications
The findings suggest that personalized expert selection based on graph topology and traffic conditions can significantly improve traffic forecasting accuracy. This approach could be applied to other domains where heterogeneous dynamics exist, enhancing predictive performance in urban analytics and beyond.
Minimax-Optimal Policy Regret in Partially Observable Markov Games
Reinforcement Learning
Theory
Efficient ML
- Introduces a unified framework for learning in partially observable Markov games against adaptive adversaries.
- Proves that an epoch-based optimistic maximum-likelihood algorithm achieves O(√T) policy regret.
- Establishes a matching lower bound for policy regret, confirming the optimality of the upper bound.
- Extends the framework to handle horizon-adaptive guarantees and adversaries with fading memory.
Read more
Minimax-Optimal Policy Regret in Partially Observable Markov Games
Summary
This paper addresses the challenge of sequential decision-making in partially observable environments where the learner faces strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The primary focus is on learning latent dynamics from partial observations while contending with adversaries whose behavior is influenced by the learner's strategy. The author introduces an epoch-based optimistic maximum-likelihood algorithm that achieves a policy regret of ˜O(√T) for fixed problem parameters, with dependencies on horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm operates by selecting a policy per geometrically growing epoch and utilizes confidence sets built from past data, allowing for efficient comparison of adversary responses. Furthermore, a matching lower bound is established, confirming the optimality of the proposed upper bound. The framework is also extended to accommodate horizon-adaptive guarantees and adversaries with geometrically fading memory, providing a comprehensive theoretical treatment of policy regret minimization in POMGs.
Methodology
The paper employs an epoch-based optimistic maximum-likelihood estimation approach to minimize policy regret in POMGs. The algorithm operates by defining geometrically growing epochs, computing confidence sets based on past data, and executing a single optimistic policy throughout each epoch. This method allows for logarithmic comparison costs of adversary responses across policies.
Results
The proposed algorithm achieves a policy regret bound of O(√T) with explicit dependencies on various problem parameters. A lower bound is also established, demonstrating that any algorithm must incur at least Ω(√dET) policy regret, thus confirming the optimality of the proposed approach. The framework is adaptable to unknown time horizons and can handle adversaries with geometrically fading memory.
Implications
The findings have significant implications for applications in multi-agent systems, such as autonomous vehicles, algorithmic trading, and cybersecurity, where decision-making occurs under uncertainty and adversarial conditions. The results provide a theoretical foundation for developing efficient learning algorithms in strategic environments.
Policy and World Modeling Co-Training for Language Agents
NLP
Large Language Models
Reinforcement Learning
- Identifies next observations in on-policy rollouts as a valuable source of action-conditioned world modeling supervision.
- Introduces PaW, the first framework for joint policy optimization and world modeling supervision during RL training.
- Incorporates innovative techniques for data selection, loss management, and balancing to enhance stability and informativeness of WM supervision.
- Demonstrates consistent performance improvements across multiple agentic tasks and RL algorithms.
Read more
Policy and World Modeling Co-Training for Language Agents
Summary
This paper introduces a novel framework called PaW (Policy and World modeling co-training) aimed at enhancing the performance of language agents through reinforcement learning (RL) by integrating world modeling (WM) directly into the RL training process. Traditional RL methods optimize actions based solely on reward maximization, often neglecting the consequences of those actions, which can lead to brittle agent behavior. The authors propose that on-policy RL rollouts inherently provide valuable world modeling supervision, as each transition pairs an action with its resulting next observation. By leveraging this observation, PaW allows for simultaneous policy optimization and world modeling without requiring additional simulators or training stages. The framework incorporates three innovative components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing to ensure informative and stable WM supervision. The authors validate their approach through experiments on various agentic tasks, demonstrating consistent improvements over strong RL baselines, thereby suggesting that standard RL rollouts can effectively serve as a source of WM supervision for training language agents.
Methodology
The PaW framework operates by reusing on-policy RL rollouts to append next-observation tokens and applying an auxiliary next-token-prediction loss. This approach allows for the joint training of policy and world modeling within the same model, maintaining the standard inference paradigm. The methodology includes action-entropy-based WM data selection to prioritize informative transitions, a clipped mean absolute error (MAE) loss to mitigate the impact of noisy observations, and a reward-adaptive loss balancing mechanism to ensure effective learning.
Results
Experiments conducted on agentic task benchmarks (ALFWorld and WebShop) show that the PaW framework consistently outperforms strong RL baselines across various models and RL algorithms. The improvements are achieved with negligible additional training overhead, indicating the effectiveness of the proposed co-training approach.
Implications
The findings suggest that integrating world modeling into the RL training process can significantly enhance the robustness and performance of language agents. This approach could lead to more reliable and capable agents in interactive decision-making tasks, with potential applications in areas such as conversational AI, automated customer service, and interactive gaming.
Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim
Reinforcement Learning
Optimization
- The energy floor for SAC-based HVAC control is measured at $35.51/day, dominated by electrical loads.
- Replay buffer initialization is identified as the main cause of sub-optimal performance, accounting for a significant cost gap.
- Expanding the supply water temperature range has negligible impact on cost savings and may violate physical constraints.
- A discount factor coupling reduces the effective planning horizon, highlighting a potential issue in benchmark configurations.
Read more
Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim
Summary
This paper investigates the energy floor in HVAC control using the Soft Actor-Critic (SAC) algorithm on the sbsim calibrated building simulator. The authors directly measure the energy floor, defined as the minimum achievable cost under action space constraints, which is found to be $35.51/day, primarily driven by continuous electrical loads. The standard SAC baseline, initialized with transitions from a schedule-policy replay buffer, achieves a cost of $37.18/day, indicating a 4.7% increase above the energy floor. The study identifies replay buffer initialization as the main source of sub-optimality, with training from an empty buffer reducing costs to $35.57/day and eliminating 96% of the gap. The paper also explores the effects of expanding the supply water temperature range, which yields minimal additional savings and can lead to physical constraint violations. Furthermore, a discount factor coupling is documented, which reduces the effective planning horizon significantly. Systematic ablation studies confirm that action space constraints are the primary bottleneck rather than algorithmic design choices.
Methodology
The authors employed minimum-action experiments to directly measure the energy floor and conducted systematic ablation studies to analyze the impact of replay buffer initialization, planning horizon, and action space constraints on HVAC control performance using the SAC algorithm.
Results
The direct measurement of the energy floor revealed a cost of $35.51/day, while the SAC baseline achieved $37.18/day. Training from an empty buffer reduced costs significantly, demonstrating the impact of replay buffer bias. The study also found that expanding the temperature range provided minimal savings and that action space constraints are the primary limiting factor.
Implications
The findings suggest that optimizing replay buffer initialization can lead to significant cost reductions in HVAC control systems. Additionally, the study highlights the importance of understanding action space constraints in energy optimization for smart buildings, which could inform future research and practical applications in building management systems.
A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
Computer Vision
- Current open-set TTA methods inadequately balance InD and OOD accuracy.
- A new baseline method using sigmoid outputs improves the trade-off between InD recognition and OOD rejection.
- The proportion of OOD data in batches significantly affects the performance of TTA methods.
- Existing evaluations of TTA methods often overlook OOD performance metrics.
Read more
A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
Summary
This paper investigates the performance of open-set test-time adaptation (TTA) methods, focusing on the balance between in-distribution (InD) accuracy and out-of-distribution (OOD) accuracy. The authors benchmark several TTA methods, including SAR, OSTTA, UniEnt, and SoTTA, using CIFAR-10-C and ImageNet-C datasets, assessing their ability to accurately detect OOD classes while maintaining InD accuracy. The study reveals that existing methods struggle to effectively filter OOD data, leading to a high rate of false positives. A new baseline method is proposed, which replaces the traditional softmax output with a sigmoid/multi-label output to better manage the trade-off between InD and OOD recognition. The authors also explore the impact of varying OOD proportions in batches and the influence of normalization layers on stability during test-time updates. Overall, the paper provides a comprehensive empirical analysis of open-set TTA, highlighting the need for improved methods that can adapt to real-world scenarios where OOD data is prevalent.
Methodology
The authors benchmarked various TTA methods on CIFAR-10-C and ImageNet-C datasets, utilizing OOD data from SVHN, CIFAR-100, ImageNet-O, and Textures. They evaluated accuracy and confidence metrics for InD and OOD recognition, and proposed a new baseline that modifies the output layer to better handle the InD/OOD trade-off. The impact of varying OOD proportions and normalization layers on TTA stability was also examined.
Results
The analysis demonstrated that existing TTA methods struggle to effectively filter OOD data, resulting in high false positive rates. The proposed baseline method showed promise in improving the balance between InD and OOD accuracy. The study also highlighted the significant influence of OOD proportions on the performance of TTA methods, revealing that current assumptions about batch composition are often unrealistic.
Implications
The findings suggest that improving OOD detection is critical for deploying machine learning models in real-world applications, particularly in fields like robotics and online moderation, where misclassifying OOD data can have serious consequences. The proposed methods and benchmarks can guide future research in developing more robust TTA techniques.
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies
Reinforcement Learning
Generative Models
Robotics
- LP-DS is a lightweight adaptation framework that enhances frozen generative policies without the need for full decoder fine-tuning.
- The method employs a Lagrangian trust-region objective to dynamically constrain perturbation magnitudes, balancing reward maximization and preservation of the pretrained latent prior.
- LP-DS effectively mitigates mode collapse, preserving action-space diversity while achieving strong performance across various benchmarks.
- The framework is validated beyond compact diffusion policies, demonstrating effectiveness in diverse robotics applications and physical robot experiments.
Read more
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies
Summary
This paper introduces Lagrangian Perturbation Diffusion Steering (LP-DS), a novel method aimed at enhancing frozen generative policies through a lightweight adaptation framework. Traditional behavior cloning with high-capacity generative policies often suffers from issues such as limited demonstration coverage and distribution shift. While direct reinforcement learning (RL) fine-tuning can enhance performance, it tends to be unstable and sample inefficient, especially when dealing with large action decoders. LP-DS addresses these challenges by learning a compact noise-space perturbation that optimizes a Lagrangian trust-region objective, allowing for improved downstream value while constraining deviations from the latent prior. The method demonstrates significant improvements in sample efficiency, success rates, and returns across various benchmarks, including RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation. Additionally, LP-DS maintains higher action-space entropy compared to unconstrained noise-space steering, thus preserving behavioral diversity. The authors validate LP-DS across different backbone classes, including flow-matching models and physical robot deployments, showcasing its versatility beyond compact diffusion policies.
Methodology
LP-DS improves frozen generative policies by learning a state-conditioned residual in latent noise space. It shifts Gaussian noise inputs and optimizes the perturbation using a Lagrangian trust-region objective, which constrains the deviation from the latent prior while allowing for task-directed policy improvement.
Results
LP-DS shows consistent improvements in sample efficiency, success rates, and returns, achieving up to 25% improvement over prior baselines. It maintains higher action-space entropy, indicating better preservation of behavioral diversity compared to other noise-space steering methods.
Implications
The findings suggest that LP-DS can be effectively applied in various robotics and continuous control tasks, potentially leading to more robust and efficient generative policies in real-world applications.
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
Large Language Models
NLP
Theory
- Introduction of bounded behavioral indistinguishability for black-box LLM distillation.
- Development of an empirical evaluation methodology combining various tests to assess behavioral indistinguishability.
- Demonstration that LoRA distillation improves semantic similarity but does not fully eliminate distinguishability.
- Identification of residual behavioral artifacts in style, format, and domain-specific prompts.
Read more
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
Summary
This paper introduces the concept of bounded behavioral indistinguishability for black-box LLM distillation, emphasizing that mere output similarity between a teacher model and a student model does not guarantee behavioral indistinguishability. The author formalizes this concept as (ϵ, q, t, A)-behavioral indistinguishability, where ϵ represents the distinguishing advantage, q the oracle query budget, t the computational budget, and A the adversary class. The methodology involves evaluating teacher-student pairs from Qwen and Llama using a controlled behavioral probe suite of 5,000 prompts. The study finds that while LoRA distillation improves semantic similarity, it does not eliminate behavioral differences, as evidenced by adversarial evaluations. The results indicate that learned discriminators still retain some distinguishing advantage, particularly in areas such as style, format, and domain-specific prompts. The paper concludes that while semantic fidelity is important, it is insufficient for ensuring indistinguishability in black-box LLM distillation, necessitating a more comprehensive evaluation approach that includes adversarial and category-aware assessments.
Methodology
The methodology involves formalizing bounded behavioral indistinguishability and employing a suite of 5,000 controlled prompts to evaluate teacher-student pairs. The evaluation combines learned discriminators, semantic similarity metrics, category-wise probes, policy-level measurements, and pairwise teacher-identification judges to assess behavioral indistinguishability.
Results
LoRA distillation increased semantic similarity scores for Qwen from 0.788 to 0.862 and for Llama from 0.814 to 0.874. However, adversarial evaluations revealed that learned discriminators still maintained a non-zero advantage, indicating residual behavioral differences. The distinguishing advantage for Qwen dropped from 0.158 for the base student to 0.081 after LoRA distillation, showing improved indistinguishability but not complete elimination.
Implications
The findings suggest that while distillation techniques can enhance the performance of smaller models, they must be evaluated through a lens that considers behavioral indistinguishability to ensure that critical behavioral characteristics are preserved. This has implications for the deployment of LLMs in sensitive applications where behavioral fidelity is crucial.
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Reinforcement Learning
Robotics
Generative Models
- Introduces coherent imitation learning (CSIL) as a method for fine-tuning large behavior models using learned dense rewards.
- Demonstrates that CSIL outperforms traditional RL approaches in terms of sample efficiency and performance retention.
- Achieves a success rate of ≥90% on five out of six complex manipulation tasks, showcasing the effectiveness of the proposed method.
- Addresses the issue of performance degradation during RL finetuning by ensuring optimal initial policies.
Read more
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Summary
This paper addresses the challenge of improving large behavior models (LBMs) for robotic control through reinforcement learning (RL) while avoiding the common pitfalls of sample inefficiency and performance degradation. The authors propose using inverse reinforcement learning (IRL), specifically coherent imitation learning (CSIL), to fine-tune LBMs with a learned dense reward function derived from expert demonstrations. This approach aims to mitigate the issues associated with sparse reward tasks in traditional RL. The paper demonstrates that CSIL can maintain or enhance the performance of pretrained policies across various manipulation tasks, achieving a success rate of over 90% on five out of six complex tasks. The method circumvents the initial performance drop typically seen in RL finetuning by ensuring that the initial pretrained policy is optimal for the learned reward and critic, thus enabling faster improvements in policy performance.
Methodology
The authors utilize coherent imitation learning (CSIL), an entropy-regularized inverse reinforcement learning algorithm, to fine-tune large behavior models. This method employs a specially designed reward structure that allows for effective policy improvement without unlearning previously acquired behaviors. The implementation incorporates recent advancements in deep actor-critic methods, including batch normalization and weight normalization, to enhance performance.
Results
The experimental results indicate that the CSIL method maintains or improves the performance of the pretrained policy (pi-0.5) across six sparse manipulation tasks. Notably, it achieves a success rate of at least 90% on five out of six complex tasks, significantly outperforming RL-based baselines that rely on sparse rewards.
Implications
The findings suggest that using learned dense rewards through inverse reinforcement learning can lead to more efficient and effective policy improvements in robotic control tasks. This approach may have broader applications in areas requiring robust adaptation to novel states, such as autonomous robotics and complex manipulation scenarios.
TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Theory
- TASER introduces a geometry-aware regularisation framework that penalises model sensitivity based on the data distribution.
- The method provides a principled alternative to isotropic gradient regularisation by aligning sensitivity with the structure of the data.
- Theoretical insights link Stein residual minimisation to reduced sensitivity under distributional perturbations.
- TASER enhances adversarial robustness by controlling sensitivity in directions that diverge from high-density regions.
Read more
TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Summary
The paper introduces TASER (Task-Aware Stein Regularisation), a novel training-time regularisation framework aimed at enhancing the robustness of deep neural networks against distribution shifts and adversarial perturbations. Traditional regularisation methods often treat input sensitivity uniformly, which can lead to vulnerabilities in directions that deviate from high-density regions of the data distribution. TASER addresses this by penalising pointwise Stein residuals, which are derived from Langevin Stein operators, thereby promoting a geometry-aware smoothness that aligns model sensitivity with the underlying data structure. The authors establish a theoretical connection between Stein regularisation and reduced first-order sensitivity to distributional shifts, demonstrating that TASER can effectively suppress sensitivity in directions that lead away from high-density areas. The method is scalable, architecture-agnostic, and can be integrated with existing training frameworks, including adversarial training. Experimental results on CIFAR-10 show that TASER significantly improves adversarial robustness without causing a statistically significant drop in clean accuracy.
Methodology
TASER employs pointwise Stein residuals derived from Langevin Stein operators to impose geometry-aware constraints on model sensitivity. The total loss function combines the task-specific loss with a regularisation term that penalises the Stein residuals, effectively shaping the sensitivity of the model according to the data distribution. The method requires access to input gradients and an estimate of the score field, which can be obtained from modern score-matching techniques.
Results
In experiments conducted on CIFAR-10, TASER consistently outperformed established training methods in terms of adversarial robustness, while maintaining comparable clean accuracy. The results indicate that the geometry-aware regularisation effectively reduces sensitivity to adversarial perturbations without compromising the model's performance on clean data.
Implications
The introduction of TASER has significant implications for the development of more robust machine learning models, particularly in applications where adversarial attacks and distribution shifts are prevalent. By integrating geometry-aware regularisation into training pipelines, practitioners can enhance model stability and reliability in real-world scenarios.
Fixed Universal Transformers
Theory
- Introduces the notion of universal transformers that can simulate any transformer in a class via input embeddings.
- Provides explicit constructions of sparse universal transformers and shows that randomly initialized transformers are universally capable.
- Establishes lower bounds on the embedding dimensions required for universality, particularly for transformers with multiple heads.
- Empirical evaluations demonstrate the effectiveness of universal transformers in specific algorithmic tasks.
Read more
Fixed Universal Transformers
Summary
This paper introduces the concept of universal transformers, which are fixed transformers capable of simulating any transformer within a specific class through appropriate input embeddings. The authors draw an analogy to universal Turing machines, where the input embedding serves as a program that encodes the parameters of the target transformer while keeping the internal parameters of the universal transformer fixed. The paper presents explicit sparse constructions that achieve universality when the embedding dimension is sufficiently large and demonstrates that randomly initialized transformers are almost surely universal. Empirical validation is conducted on tasks such as parenthesis balancing and multi-hop reasoning, suggesting that a significant portion of a transformer's expressive power may derive from its input representation rather than its learned weights.
Methodology
The authors formalize the concept of universal transformers and provide explicit constructions with fixed parameters. They analyze the conditions under which these transformers can simulate target transformers and establish theoretical lower bounds. Empirical evaluations are conducted on specific tasks to validate the theoretical claims.
Results
The paper shows that a fixed universal transformer can simulate any target transformer with appropriate embeddings, achieving universality under certain conditions. The empirical results indicate high accuracy in tasks like parenthesis balancing and multi-hop reasoning, supporting the theoretical findings.
Implications
The findings suggest that universal transformers can significantly enhance model reprogramming techniques and expand the potential for transfer learning in deep learning applications. The emphasis on input representation may lead to new approaches in designing transformer architectures.
VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting
Time Series
- VLBM effectively separates stable dynamics from OOD deviations in multivariate time series forecasting.
- The model uses a latent basis to capture stable ID dynamics and decomposes inputs into relevant components.
- VLBM achieves state-of-the-art performance in OOD robustness and ID accuracy across multiple real-world tasks.
- The framework addresses a critical reliability issue in forecasting under mixed ID/OOD conditions.
Read more
VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting
Summary
The paper introduces VLBM (Variational Latent Basis Model), a novel framework aimed at improving the robustness of multivariate time series forecasting in the presence of out-of-distribution (OOD) events. Traditional forecasting models often struggle with OOD events due to their rarity and the dominance of in-distribution (ID) patterns during training. VLBM addresses this by separating stable dynamics from OOD deviations through a latent space approach. It learns a shared latent basis that captures stable ID dynamics while explicitly decomposing inputs into basis components and orthogonal residuals. The model aligns a future-aware posterior with a future-blind prior, ensuring that test-time latent inference relies solely on historical data. The authors validate VLBM across 12 benchmark tasks, including newly constructed OOD traffic datasets, demonstrating significant improvements in OOD robustness and ID accuracy, achieving average gains of 15.08% in MAE and 7.74% in MSE over the strongest baseline. The results underscore the effectiveness of latent structured forecasting for reliable predictions under mixed ID and OOD conditions.
Methodology
VLBM employs a variational latent basis approach to time series forecasting. It learns a shared latent basis for stable dynamics, decomposes historical representations into basis and residual components, and utilizes an Orthogonal Base-Residual generator to manage stable dynamics and OOD deviations separately. The model aligns a future-aware posterior with a future-blind prior to facilitate test-time inference based solely on historical inputs.
Results
VLBM demonstrates significant improvements in forecasting accuracy, achieving average MAE and MSE gains of 15.08% and 7.74%, respectively, over the strongest baseline across 12 benchmark tasks. It also shows superior performance in tracking OOD pulse recovery on synthetic datasets, establishing its effectiveness in handling mixed ID/OOD conditions.
Implications
The findings suggest that VLBM can enhance the reliability of forecasting models in critical applications such as transportation, energy, and environmental monitoring, where OOD events can have significant impacts. The approach may also inspire further research into latent structured forecasting techniques for other complex time series problems.
Fast Generalization after Interpolation via Critically Damped Momentum Optimization
Optimization
Theory
Efficient ML
- GROKtimizer is introduced as a biphasic optimization strategy that enhances generalization in high-dimensional settings.
- The paper links the post-interpolation phase of training to damped dynamics, providing a theoretical foundation for the proposed method.
- Critically Damped Momentum (CDM) is shown to accelerate convergence towards low-norm solutions, which are associated with better generalization.
- The method demonstrates a quadratic speedup over traditional gradient descent, making it more efficient.
Read more
Fast Generalization after Interpolation via Critically Damped Momentum Optimization
Summary
This paper addresses the challenge of generalization in machine learning models, particularly in high-dimensional, low-sample scenarios where models can achieve low training error but fail to generalize well to unseen data. The authors introduce GROKtimizer, a biphasic optimization strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM) for post-interpolation norm minimization. They theoretically demonstrate that this approach provides a quadratic speedup over classical gradient descent and is optimal among first-order optimizers. The study emphasizes the importance of post-interpolation dynamics and characterizes the post-interpolation regime as a local quadratic dynamical system, linking delayed generalization to damped oscillator behavior along flat directions of the loss landscape. The authors validate their method on synthetic benchmarks and real-world datasets, highlighting its effectiveness in selecting low-norm interpolating solutions and improving generalization performance.
Methodology
The authors propose a biphasic optimization schedule that first drives models to interpolation using standard techniques, followed by a switch to Critically Damped Momentum dynamics for post-interpolation training. They characterize the post-interpolation regime as a local quadratic dynamical system and analyze the optimization dynamics to select low-norm solutions.
Results
GROKtimizer achieves a quadratic speedup in convergence compared to classical gradient descent and demonstrates provable optimality among first-order optimizers. Empirical results on various benchmarks indicate significant improvements in generalization performance, particularly in high-dimensional datasets.
Implications
The findings suggest that optimizing post-interpolation dynamics can lead to more reliable generalization in machine learning models, especially in fields with high-dimensional data and limited samples, such as genomics, medicine, and finance. This approach could inform future optimization strategies and regularization methods.
Perturbative methods for non-parametric instrumental variable
Theory
- Introduces a perturbative approach to NPIV estimation that improves accuracy in high-dimensional settings.
- Demonstrates significant reduction in prediction error (up to 99%) compared to standard kernel ridge regression.
- Addresses the curse of dimensionality by systematically correcting kernel ridge solutions with higher-order perturbations.
- Shows that the method is particularly effective when the dimensionality grows rapidly with sample size.
Read more
Perturbative methods for non-parametric instrumental variable
Summary
This paper presents a novel perturbative approach for non-parametric instrumental variable (NPIV) estimation, inspired by perturbation theory in physics. The authors extend standard kernel ridge regression methods by incorporating systematic higher-order perturbation corrections, which significantly enhance estimation accuracy, particularly in high-dimensional settings where traditional methods struggle due to the curse of dimensionality. The proposed method introduces mixing between different eigenmodes of the expectation integral operator, which is crucial for addressing ill-defined integral equations. The authors demonstrate that their first-order perturbative corrections can reduce prediction errors by up to 99% in high-dimensional cases (β > 0.7) compared to standard ridge regression. The method maintains its performance across various dimensionality regimes, with improvements becoming more pronounced as dimensionality increases. The paper highlights the limitations of classical NPIV estimation methods, particularly in high-dimensional scenarios, and proposes a perturbative renormalization approach that preserves the desirable properties of kernel methods while effectively managing ill-conditioning issues. Through extensive experiments, the authors show that their approach outperforms traditional methods, particularly with fractional Brownian kernels, suggesting broader applications of perturbative techniques in machine learning.
Methodology
The authors develop a perturbative expansion that introduces higher-order moment interactions to capture complex dependencies in high-dimensional spaces. They employ a renormalization procedure to adaptively rescale higher-order terms and a resurgence technique to manage divergent series. The approach is applied to NPIV estimation using kernel ridge regression, focusing on the spectral properties of the expectation integral operator.
Results
The experimental results indicate that the proposed perturbative corrections lead to substantial improvements in mean square error, particularly with fractional Brownian kernels, achieving reductions of up to 99% in prediction error in high-dimensional scenarios. The method shows consistent performance across varying dimensionality, outperforming traditional NPIV methods.
Implications
The findings suggest that perturbative methods can effectively address challenges in causal inference, particularly in high-dimensional contexts. This approach may open new avenues for research and applications in machine learning, especially in areas requiring robust estimation under ill-conditioning.
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
Theory
Large Language Models
- Positional encoding is not necessary for Turing completeness in sliding-window transformers.
- The sliding window mechanism introduces temporal asymmetry that breaks permutation symmetry.
- The HIST model demonstrates that a finite control state and token-count histogram can achieve universal computation.
- The paper provides a theoretical foundation for understanding the expressiveness of transformers without relying on positional encodings.
Read more
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
Summary
This paper challenges the conventional belief that positional encoding (PE) is essential for transformers to process ordered sequences and achieve Turing completeness. The authors argue that in the context of long-form reasoning, where generation occurs through a finite sliding context window, the window mechanism itself introduces a form of temporal asymmetry that breaks permutation symmetry. They introduce the HIST model, an abstract autoregressive model that operates solely on a constant-size internal state and the token-count histogram within the current window. The authors prove that this model is Turing complete by demonstrating that the evolution of the sliding window can reveal tokens that have exited the window, thus enabling the simulation of Turing-complete Post machines. Furthermore, they construct a sliding-window transformer that operates without PE and show that it can simulate the HIST model, indicating that positional encodings are not necessary for universal computation. This work separates the concepts of positional encoding and sequential structure, suggesting that the dynamic nature of the sliding window itself provides sufficient information for universal computation.
Methodology
The authors introduce the HIST model, which relies on a finite control state and the histogram of token counts within a sliding window. They prove the Turing completeness of this model by simulating Post machines and demonstrating how the sliding window can reveal information about tokens that have exited. They also construct a sliding-window transformer that operates without positional encodings and shows its capability to simulate the HIST model.
Results
The paper establishes that sliding-window transformers can achieve Turing completeness without positional encoding. The authors successfully demonstrate that the sliding window's dynamic nature provides sufficient sequential structure for universal computation, effectively separating the roles of positional encoding and computational expressiveness.
Implications
This research suggests that transformer architectures can be designed more efficiently by potentially omitting positional encodings, which could lead to reduced model complexity and improved performance in long-form reasoning tasks. It opens avenues for further exploration of autoregressive models and their capabilities in various applications.