AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
47
Papers today
8h
Update frequency
7
Days of history
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Computer Vision
Theory
Optimization
- Introduction of a large-scale benchmark for post-hoc calibration covering diverse tasks and models.
- Standardized implementations of numerous calibration methods for reproducible comparisons.
- Proposed Post-Hoc Improvement (PHI) metric for evaluating calibration methods.
- Empirical findings indicate smooth calibration functions outperform binning methods.
Read more
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Summary
The paper introduces CalArena, a comprehensive benchmark for evaluating post-hoc calibration methods in machine learning. Recognizing the critical importance of reliable probability estimates in applications such as medical diagnosis and fraud detection, the authors highlight the prevalent issue of poorly calibrated classifiers. They present a large-scale benchmark that encompasses nearly 2000 experiments across various tasks, including binary and multiclass classification in both tabular and computer vision domains. The benchmark integrates predictions from a wide array of classical and modern models, offering standardized implementations of numerous calibration methods. A novel evaluation approach, termed Post-Hoc Improvement (PHI), is proposed, which assesses both calibration quality and any potential degradation in predictive performance. The empirical study conducted reveals key insights, such as the superiority of smooth calibration functions over binning-based methods and the necessity of dedicated multiclass approaches in high-dimensional contexts. The authors provide all data, code, and tools to facilitate future research, aiming to establish a reliable foundation for the development and comparison of calibration methods.
Methodology
The authors constructed a suite of benchmarks that includes nearly 2000 classification experiments across various data modalities and task types. They standardized and evaluated implementations of multiple post-hoc calibration methods, using validation set predictions to fit calibration functions and test set predictions for performance evaluation. The PHI metric was developed to assess both calibration error reduction and predictive performance degradation.
Results
The comprehensive empirical study revealed that smooth calibration functions consistently outperformed binning-based approaches. Additionally, dedicated multiclass calibration methods were found to be essential in high-dimensional settings, while generic models without calibration-specific designs were less competitive. The results were aggregated into leaderboards for both binary and multiclass benchmarks.
Implications
The findings from this study have significant implications for practitioners in machine learning, particularly in high-stakes applications where reliable probability estimates are crucial. The benchmark and tools provided can guide the selection and development of effective calibration methods, enhancing the trustworthiness of machine learning systems.
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Theory
Reinforcement Learning
Efficient ML
- Introduces tight sample complexity bounds for contextual bandits with sparse rewards.
- Achieves significant improvement over previous bounds by eliminating high-degree polynomial dependencies.
- Utilizes two complementary approaches: exploration-by-optimization and low-variance exploration.
- Establishes minimax optimality of the proposed bounds up to logarithmic factors.
Read more
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Summary
This paper investigates the sample complexity of contextual bandits in a stochastic i.i.d. setting, focusing on the s-sparse scenario where the reward vector has a limited L1-norm. The authors propose algorithms that achieve an ε-optimal policy with a sample complexity of eO((s/ε² + |A|/ε) log |Π|/δ), significantly improving upon previous bounds that included a Θ(|A|⁹) dependence. The study bridges a gap in existing literature by providing tight sample complexity bounds for contextual bandits with sparse rewards, particularly for the bandit multiclass classification problem. The authors utilize two main approaches: an exploration-by-optimization algorithm informed by the decision-estimation coefficient (DEC) and a low-variance exploration technique that leads to tractable algorithms. The results demonstrate that the new bounds are minimax optimal up to logarithmic factors, indicating a linear dependence on the number of actions in the lower-order term, while the dominant term is governed by the sparsity parameter. This work enhances the understanding of sample complexity in contextual bandits and offers improved guarantees for practical applications.
Methodology
The authors develop algorithms based on two main approaches: first, they analyze contextual bandits through structured observations and design an exploration-by-optimization algorithm guided by the decision-estimation coefficient (DEC). Second, they create a low-variance exploration technique that leads to concrete algorithms, extending the results to contextual combinatorial semi-bandits.
Results
The paper presents algorithms that achieve an ε-optimal policy with sample complexity eO((s/ε² + |A|/ε) log |Π|/δ), which is a substantial improvement over prior work. The results include a matching lower bound, confirming the optimality of the proposed sample complexity bounds.
Implications
The findings have significant implications for online learning and decision-making applications, such as recommendation systems and adaptive experimentation, where efficient sample usage is crucial. The improved sample complexity guarantees can enhance the performance of algorithms in practical scenarios involving contextual bandits.
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference
Large Language Models
Efficient ML
- MarginGate selectively verifies only low-margin decoding steps, significantly reducing computational overhead.
- The approach restores deterministic decoding in LLMs while maintaining high performance on high-margin steps.
- The method demonstrates a substantial reduction in latency compared to existing per-token verification methods.
- Empirical results show that token flips during batch processing are rare, allowing for targeted verification.
Read more
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference
Summary
The paper introduces MarginGate, a novel approach for ensuring deterministic inference in large language models (LLMs) during batch processing. Traditional methods for achieving determinism in BF16 LLM inference often incur significant computational costs due to their broad application across all decoding steps. MarginGate addresses this by focusing verification efforts exclusively on steps where token flips are likely to occur, which the authors found to be sparse (0.3-1.3% of decode steps) across multiple models and datasets. By leveraging low top-1/top-2 logit margins as indicators of potential flips, MarginGate selectively invokes a deterministic verifier only for low-margin steps while maintaining high-margin steps on the faster BF16 path. This selective verification significantly reduces the number of verifier triggers compared to existing methods like LLM-42, leading to improved efficiency without sacrificing determinism. The authors demonstrate that MarginGate achieves 100% sequence-level deterministic decoding on models such as Llama-3.1-8B and Qwen2.5-14B, with substantial reductions in latency compared to traditional verification methods.
Methodology
The authors conducted empirical measurements across five LLMs to analyze the frequency of batch-induced token flips and the conditions under which they occur. They developed MarginGate as a selective verification policy that utilizes logit margins to determine when to invoke a deterministic verifier. The implementation was tested on multiple datasets, calibrating thresholds for optimal performance.
Results
MarginGate achieved 100% deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with verifier trigger rates of 18.56% and 15.05%, respectively. It reduced the latency increment of LLM-42 by 2.23× and 1.99× for these models. On DSR1-Distill-Qwen-7B, the method reached determinism with a higher trigger rate of 49.50%, indicating its effectiveness even in more challenging scenarios.
Implications
The findings suggest that selective verification can significantly enhance the efficiency of LLM inference in production environments, making it feasible to deploy high-performance models without sacrificing determinism. This approach could be applied to various applications requiring reliable and reproducible outputs from LLMs, such as automated content generation, code synthesis, and interactive AI systems.
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
NLP
Large Language Models
Generative Models
Efficient ML
- BlockBatch introduces block-size diversity as a new axis for improving dLLM inference efficiency.
- The framework allows for simultaneous processing of multiple block sizes, enhancing parallelism and accuracy.
- Confidence-gated merging and leader-based synchronization optimize the use of computational resources.
- Periodic full-sequence refreshes correct accumulated errors in the KV cache, ensuring consistency.
Read more
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
Summary
The paper introduces BlockBatch, an innovative framework designed to enhance the efficiency of diffusion language model (dLLM) inference by leveraging multi-scale consensus decoding. Traditional block-wise dLLM inference faces a trade-off between block size and the number of denoising steps required, with smaller blocks preserving local context but necessitating more steps, while larger blocks allow for greater parallelism but risk premature commitments and cache errors. The authors argue that block size should not be treated as a fixed hyperparameter but as a branching dimension that can provide complementary decoding trajectories. BlockBatch executes multiple block-size branches in parallel during a single batched forward pass, coordinating them through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes. This approach allows for the preservation of useful information across branches while correcting for inconsistencies in the KV cache. The framework was tested across three dLLMs and four datasets, demonstrating a significant reduction in denoising function evaluations (NFEs) by an average of 26.6% and achieving a 1.33× speedup over existing methods while maintaining accuracy.
Methodology
BlockBatch employs a training-free online inference strategy that executes multiple branches of varying block sizes within a single batched model invocation. It utilizes confidence-gated token merging to ensure compatibility between branches, leader-based synchronization to optimize computation, and periodic full-sequence refreshes to maintain a consistent KV state across branches.
Results
The implementation of BlockBatch resulted in an average reduction of 26.6% in denoising function evaluations (NFEs) and a 1.33× increase in end-to-end inference speed compared to Fast-dLLM, while preserving the accuracy of the generated text.
Implications
The findings suggest that exploring block-size diversity can lead to more efficient dLLM inference methods, potentially influencing future research and applications in natural language processing and generative models. This could enhance the performance of various applications, including text generation, dialogue systems, and other NLP tasks.
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging
Large Language Models
Efficient ML
Optimization
- Introduction of MergePipe as a budget-aware execution layer for LLM merging.
- Reformulation of model merging as an expert access-set problem to optimize I/O operations.
- Demonstration of significant reductions in expert-read I/O and substantial speed improvements.
- Establishment of theoretical bounds for omitted updates and budget soundness in merging operations.
Read more
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging
Summary
This paper introduces MergePipe, a novel execution layer designed for weight-space model merging, particularly in the context of large language models (LLMs). Traditional merging methods treat model checkpoints as opaque files, leading to inefficiencies as the number of expert parameters increases. MergePipe reframes the merging process as an expert access-set problem, where the goal is to optimize which expert delta blocks to read under a specified I/O budget. The authors present a systematic approach to index parameter blocks, create deterministic access plans, and execute merges that are budget-aware. They demonstrate that MergePipe can significantly reduce expert-read I/O by up to an order of magnitude and achieve speedups of up to 11 times on merging tasks involving Qwen and Llama checkpoint families. The method maintains a low parameter deviation from full-read merges and shows no degradation in downstream performance, thus providing a practical solution for scalable model merging in LLMs.
Methodology
The authors developed MergePipe, which indexes parameter blocks and constructs deterministic access plans based on a budgeted resource model. The merging process is executed with a focus on minimizing expert reads while ensuring that the merging operation remains consistent with full-read merges when the budget allows. The methodology includes theoretical proofs of budget soundness and error bounds for omitted updates in additive merges.
Results
MergePipe achieved up to a 10x reduction in expert-read I/O and up to 11x speedup in merging tasks on Qwen and Llama models. The approach maintained a parameter deviation of O(10^-3) from full-read merges and did not exhibit monotonic degradation in downstream benchmarks, indicating its effectiveness and reliability.
Implications
The findings suggest that MergePipe can facilitate more efficient model merging in large-scale LLM applications, potentially leading to faster deployment and reduced computational costs. This approach could be particularly beneficial in scenarios where multiple expert models need to be integrated without incurring the high costs associated with full retraining or ensemble methods.
Inferring the Size of Large Language Models From Popular Text Memorization
Large Language Models
NLP
- Introduces a method to infer lower bounds on LLM sizes using text memorization signals.
- Develops two complementary inference methods: a pairwise statistical test and a scaling-law estimator.
- Validates the methods on both open-weight and closed-weight models, achieving high accuracy.
- Reveals differences in scaling strategies among major LLM developers based on inferred sizes.
Read more
Inferring the Size of Large Language Models From Popular Text Memorization
Summary
This paper addresses the challenge of inferring the parameter counts of large language models (LLMs), which are often undisclosed by their developers. The author proposes a black-box method that utilizes the memorization of popular texts to derive conservative lower bounds on model size based solely on generated text outputs. The methodology is founded on the observation that widely-circulated texts are present in most pretraining corpora, and the accuracy of next-word predictions across varying text fragments serves as a reliable indicator of a model's memorization capacity, which correlates with its parameter count. The author introduces two inference methods: a pairwise statistical test to compare the sizes of two models and a scaling-law estimator that uses Principal Component Analysis (PCA) to project accuracy profiles into a latent index representing parameter counts. The methods are validated on a range of open-weight models, achieving high accuracy, and are then applied to popular closed-weight models, revealing insights into industry scaling strategies and internal product hierarchies.
Methodology
The methodology involves aggregating a model's accuracy in predicting the next word from popular texts into an accuracy profile vector. Two inference methods are derived: (1) a scaling-law estimator that uses PCA to project the accuracy profiles into a latent index, and (2) a pairwise statistical test to compare the sizes of two models. These methods are calibrated and validated against a set of open-weight models before being applied to closed-weight models.
Results
The scaling-law estimator achieved an R2 of 0.95 when validated on 19 open-weight models, while the pairwise test demonstrated around 90% precision and recall. The framework successfully produced lower bounds for 21 closed-weight models, revealing significant variations in inferred sizes based on the provider.
Implications
This work provides a novel approach for estimating LLM sizes, which can enhance the understanding of model capabilities and inform comparisons between open and closed models. It also highlights the potential for probing hidden design choices in LLMs, which could influence future model development and deployment strategies.
When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer
Reinforcement Learning
Large Language Models
- RL on constraint-satisfaction puzzles can transfer to hard mathematics benchmarks, isolating RL's contribution from SFT.
- A primitive- and motif-level analysis framework reveals a depth-recovery tradeoff in reasoning capabilities.
- Introducing a perplexity-based novelty bonus during RL training restores suppressed reasoning primitives and improves performance.
- The proposed method increases the hard-math capability ceiling from 16.0% to 36.0% without using math problems in training.
Read more
When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer
Summary
This paper investigates the transfer of reasoning capabilities from reinforcement learning (RL) trained on constraint-satisfaction puzzles to hard mathematics benchmarks, specifically focusing on a 7B model that underwent supervised fine-tuning (SFT) and RL without exposure to mathematical problems. The authors introduce a reasoning primitive-level framework that segments reasoning processes into nine primitives, allowing for an analysis of how these primitives evolve during training. The study finds that SFT on puzzles enhances a reasoning-primitive vocabulary, leading to significant performance improvements on math benchmarks. However, the vanilla RL stage suppresses exploratory reasoning primitives such as hypothesizing and backtracking. To counteract this suppression, the authors propose a novelty bonus that rewards diverse rollouts, which successfully restores the suppressed primitives while maintaining depth in reasoning chains. The end result is a notable increase in performance on hard math tasks, demonstrating that RL on puzzles can effectively enhance mathematical reasoning capabilities without direct training on math problems.
Methodology
The authors employed a 7B model trained with supervised fine-tuning (SFT) on constraint-satisfaction puzzles, followed by reinforcement learning (RL) using a novelty bonus. They developed a reasoning primitive-level framework that classifies reasoning into nine primitives and analyzes their evolution through motif extraction.
Results
The study demonstrated a 20 percentage point increase in performance on the OlymMATH-Hard benchmark, rising from 16.0% to 36.0% after SFT and RL on puzzles. The novelty bonus added an additional 7 percentage points to this performance, indicating its effectiveness in preserving diverse reasoning capabilities.
Implications
The findings suggest that RL can enhance reasoning capabilities in domains where direct training data is not available, potentially informing future approaches to training large language models for complex problem-solving tasks. This could lead to improved performance in various applications requiring mathematical reasoning.
A Geometric View of SRC: Learning Representations for Stable Residual Inference
Theory
- Introduces a geometric framework for understanding the stability of Sparse Representation Classification (SRC).
- Establishes a strict separation between training and inference, treating SRC as a fixed inference rule.
- Identifies geometric obstructions that can collapse the residual margin, affecting classification reliability.
- Derives a quantitative lower bound on the residual margin under specific geometric conditions.
Read more
A Geometric View of SRC: Learning Representations for Stable Residual Inference
Summary
This paper presents a geometric perspective on Sparse Representation Classification (SRC), focusing on the stability of reconstruction-based inference through the lens of learned representations. The author emphasizes a strict separation between training and inference, treating SRC as a fixed rule applied at test time without optimization during training. The study formalizes the concept of residual-ordering stability, introducing a residual margin that quantifies the reliability of class assignments based on reconstruction residuals. The paper identifies geometric obstructions such as span overlap and dominance that can lead to ambiguous residual comparisons. By analyzing class-conditional spans and their arrangements, the author derives a quantitative lower bound on the residual margin under specific coverage and separation assumptions. To enhance stability, geometry-shaping objectives are proposed that promote self-expressiveness within classes and discourage cross-class interactions. The experiments conducted on various datasets (images, text, and EEG connectivity) validate the effectiveness of the proposed methods, demonstrating improvements in residual margins and geometric diagnostics compared to traditional approaches.
Methodology
The methodology involves a geometric analysis of class-conditional spans and their arrangements to formalize residual-ordering stability. The author derives a lower bound on the residual margin based on geometric properties and introduces training objectives that shape the learned representations to promote stability in SRC inference.
Results
The experiments demonstrate that the proposed geometry-shaping objectives lead to improved residual margins and better geometric diagnostics across various datasets, including COIL-100, TREC, and EEG connectivity, compared to traditional SRC approaches.
Implications
The findings suggest that enhancing the geometric properties of learned representations can significantly improve the reliability of reconstruction-based inference methods like SRC. This has potential applications in various fields, including computer vision and signal processing, where accurate classification based on reconstruction is critical.
Momentum Based Reward Design for Low Emission Traffic Signal Control
Reinforcement Learning
Optimization
- Introduction of a Momentum-Based Reward Function (MBRF) for traffic signal control.
- MBRF promotes continuous vehicle movement rather than penalizing congestion.
- Evaluation conducted using SUMO with standard traffic metrics.
- Results indicate improved throughput-emission trade-offs and stable learning behaviors.
Read more
Momentum Based Reward Design for Low Emission Traffic Signal Control
Summary
This paper addresses the issue of urban traffic congestion, which significantly impacts commute times and environmental pollution. Traditional traffic signal control systems struggle to adapt to dynamic traffic conditions, leading to inefficiencies. The authors propose a novel Momentum-Based Reward Function (MBRF) aimed at enhancing Deep Reinforcement Learning (DRL) for traffic signal control. Unlike conventional delay or queue-based rewards that can lead to short-sighted policies, MBRF encourages continuous vehicle movement by promoting sustained flow dynamics. The methodology is evaluated using the SUMO simulation platform, where standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions are analyzed. The results demonstrate that MBRF yields better throughput-emission trade-offs and more stable learning behaviors compared to existing reward structures and classical controllers like Max Pressure and LQF. This innovative approach aligns the learning objectives with real-world traffic behaviors, ultimately improving traffic efficiency and reducing emissions without explicitly optimizing for environmental metrics.
Methodology
The authors formulated the traffic signal control problem as a Markov Decision Process (MDP) and implemented the MBRF within a DRL framework. The MBRF incentivizes sustained vehicle motion by rewarding phase persistence based on observed discharge efficiency, contrasting with traditional penalty-based reward systems.
Results
The proposed MBRF outperformed traditional delay and queue-based rewards, leading to enhanced throughput and lower CO2 emissions. The learning behavior exhibited greater stability compared to classical traffic controllers, demonstrating the effectiveness of the momentum-based approach.
Implications
This research has significant implications for urban traffic management, suggesting that adaptive traffic signal control can be improved through innovative reward designs. The findings could inform the development of more efficient and environmentally friendly traffic systems, potentially reducing congestion and emissions in urban areas.
Parallax: Parameterized Local Linear Attention for Language Modeling
NLP
Large Language Models
Efficient ML
- Introduction of Parallax, a scalable parameterized Local Linear Attention mechanism.
- Demonstrated improvements in perplexity and downstream accuracy over traditional Softmax Attention.
- Development of a hardware-aware algorithm that enhances computational efficiency.
- Identification of a strong interaction between the Parallax architecture and the Muon optimizer.
Read more
Parallax: Parameterized Local Linear Attention for Language Modeling
Summary
This paper introduces Parallax, a novel parameterized Local Linear Attention (LLA) mechanism designed to enhance the efficiency and scalability of large language models (LLMs). Traditional attention mechanisms, particularly Softmax Attention, have remained largely unchanged despite the growing demand for efficient alternatives. Parallax addresses the limitations of LLA, which had not been effectively scaled for LLM pretraining due to computational and numerical stability issues. By eliminating the need for a numerical solver and incorporating a query-like projector that probes key-value covariance, Parallax achieves a better bias-variance tradeoff. The authors propose a hardware-aware algorithm that increases arithmetic intensity, allowing Parallax to outperform existing methods like FlashAttention across various batch sizes and context lengths. The empirical results demonstrate consistent improvements in perplexity and downstream task performance when pretrained with Parallax, showcasing a significant interaction between architecture and optimizer, particularly with the Muon optimizer. This work represents a significant advancement in attention mechanisms, providing a unified interpretation of nonparametric and parametric designs and highlighting the importance of architecture-optimizer co-design.
Methodology
The authors developed Parallax by modifying the Local Linear Attention mechanism to eliminate the numerical solver and introduce a query-like projector for probing key-value covariance. They implemented a hardware-aware streaming algorithm to optimize performance and conducted extensive pretraining experiments at two scales (0.6B and 1.7B parameters) to evaluate the effectiveness of Parallax against traditional attention mechanisms.
Results
Parallax consistently outperformed Softmax Attention in perplexity and downstream accuracy across various experiments. The improvements were maintained under both parameter-matched and compute-matched controls, indicating a Pareto improvement. The interaction with the Muon optimizer was particularly beneficial, leading to substantial performance gains.
Implications
The findings suggest that Parallax can significantly enhance the efficiency and effectiveness of large language models, paving the way for more advanced applications in natural language processing and other AI domains. The insights on architecture-optimizer interactions may influence future designs of attention mechanisms and training strategies.
Conf-Gen: Conformal Uncertainty Quantification for Generative Models
Generative Models
Theory
Large Language Models
- Conf-Gen extends conformal risk control to generative models, enabling uncertainty quantification in unsupervised learning.
- The framework relaxes theoretical assumptions of traditional conformal prediction, making it applicable to a wider range of tasks.
- Conf-Gen has been empirically validated, showing superior performance in LLM question answering and other generative tasks.
- A flexible Python package is provided to support the implementation of Conf-Gen across various applications.
Read more
Conf-Gen: Conformal Uncertainty Quantification for Generative Models
Summary
The paper introduces Conf-Gen, a novel framework for applying conformal risk control (CRC) to generative models, addressing the challenge of uncertainty quantification (UQ) in unsupervised learning contexts. Traditional conformal prediction (CP) methods are primarily designed for supervised learning and do not directly apply to generative tasks. Conf-Gen adapts CRC to generative models by relaxing theoretical assumptions, allowing for the quantification of uncertainty in various generative scenarios. The authors demonstrate the framework's flexibility through applications such as ensuring image generators produce non-memorized outputs, verifying that conversational AI systems ask sufficient clarifying questions, and confirming the correctness of AI agent outputs. Additionally, they provide a Python package to facilitate the implementation of Conf-Gen, which includes efficient computation strategies. Empirical results show that Conf-Gen outperforms existing conformal baselines in tasks involving large language models (LLMs) and other generative applications.
Methodology
The authors propose a framework called Conf-Gen that adapts conformal risk control for generative tasks. They relax certain theoretical assumptions, derive new bounds on expected utility, and identify computational patterns to facilitate efficient implementation. The framework is validated through empirical tests across multiple generative applications.
Results
Conf-Gen demonstrates improved performance over state-of-the-art conformal baselines in tasks such as LLM question answering. It successfully provides conformal guarantees for generating non-memorized images, ensuring adequate clarifying questions in conversational AI, and producing correct outputs from AI agents.
Implications
The development of Conf-Gen has significant implications for the deployment of generative models in high-stakes applications, such as healthcare and scientific discovery, where reliable uncertainty quantification is critical. It opens avenues for further research in uncertainty quantification for generative models and enhances the reliability of AI systems.
Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts
Theory
Time Series
- Current physics foundation models demonstrate conditional rather than universal generalization capabilities.
- Model performance is significantly affected by physical regime, temporal scale, and initial conditions.
- Increasing training data complexity only partially addresses generalization limitations.
- Pretraining and model scaling do not consistently eliminate biases and can sometimes introduce negative transfer.
Read more
Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts
Summary
This paper investigates the generalization capabilities of physics foundation models, which are increasingly used for spatiotemporal forecasting in physical systems. The authors argue that existing evaluations often reduce model performance to a single average score under fixed training distributions, obscuring whether models have learned generalizable physical dynamics or merely excel in specific settings. To address this, they construct a comprehensive benchmark featuring 8 physical dynamics, 3 training-data mixtures, and 25 test regimes that account for dynamic-scale and initial-condition complexity shifts. The study evaluates five different physics foundation model architectures and their variants, resulting in 60,000 measurements. The findings reveal that these models exhibit conditional generalization, with performance heavily influenced by factors such as physical regime, temporal scale, initial conditions, and model architecture. The authors conclude that improving these models requires a shift from merely scaling and expanding data to developing mechanisms that better capture transferable physical knowledge across diverse conditions.
Methodology
The authors designed a benchmark that varies key axes of generalization, including physical dynamics, training data mixtures, and test regimes. They generated PDE trajectories using the APEBench/Exponax toolkit and evaluated five different physics foundation model architectures with multiple variants, leading to a large dataset of performance measurements across various conditions.
Results
The results indicate that current physics foundation models are conditional generalists, with performance varying widely based on the physical regime and other factors. For instance, models showed significantly higher error rates under certain dynamics and distribution shifts, with some architectures performing worse when pretrained. The study highlights that simply increasing training complexity does not resolve the underlying issues of generalization.
Implications
The findings suggest that to enhance the generalization capabilities of physics foundation models, future research should focus on developing learning mechanisms that effectively capture transferable physical knowledge across different regimes and conditions, rather than solely relying on data scaling or model size.
In-Context Reward Adaptation for Robust Preference Modeling
Reinforcement Learning
Large Language Models
Theory
- Proposes In-Context Reward Adaptation to model diverse human preferences dynamically.
- Incorporates human response time as an auxiliary signal to enhance preference modeling.
- Demonstrates that traditional binary preference comparisons are insufficient for robust adaptation.
- Validates the method through experiments on synthetic and real-world datasets.
Read more
In-Context Reward Adaptation for Robust Preference Modeling
Summary
This paper addresses the limitations of static reward models in Reinforcement Learning from Human Feedback (RLHF), which often fail to generalize across diverse human preferences. The authors propose a novel framework called In-Context Reward Adaptation, leveraging the in-context learning capabilities of transformer models to adaptively infer reward structures from a small set of preference demonstrations. The study highlights that traditional approaches, which rely solely on binary preference comparisons, are insufficient for adapting to unseen human preferences. To overcome this, the authors introduce human response time as an auxiliary input, which provides additional information about preference strength. The theoretical analysis demonstrates that incorporating response time restores the identifiability of reward parameters, enabling effective adaptation to new preference domains. Experiments on both synthetic and real-world datasets validate the robustness of the proposed method, showing significant improvements in handling preference distribution shifts. Overall, the findings suggest that scalable human-AI alignment requires adaptive learning mechanisms and richer forms of human feedback beyond binary comparisons.
Methodology
The authors utilize a transformer-based architecture to implement In-Context Reward Adaptation, allowing the model to infer reward structures from a few preference demonstrations. They introduce human response time as an auxiliary input to enhance the model's ability to adapt to unseen preferences, addressing the limitations of traditional binary preference data.
Results
The proposed method shows substantial improvements in robustness when adapting to previously unseen human preferences, outperforming traditional static reward models. The incorporation of response time significantly enhances the model's ability to identify and adapt to diverse reward structures, demonstrating effective handling of preference distribution shifts.
Implications
This work suggests that future RLHF systems should integrate adaptive learning mechanisms and consider richer forms of human feedback to improve alignment with diverse human preferences. The findings could lead to more flexible and scalable AI systems capable of better understanding and responding to human values.
Molecular Lead Optimization via Agentic Tool Planning
Optimization
Large Language Models
- TRACE introduces a trajectory-aware approach to molecular lead optimization, improving decision-making over traditional methods.
- The agent effectively coordinates multiple optimization tools while adhering to structural constraints.
- In-context self-correction enhances the stability and reliability of the optimization process.
- Experimental results show significant improvements in ADMET-related properties compared to baseline models.
Read more
Molecular Lead Optimization via Agentic Tool Planning
Summary
This paper addresses the critical stage of molecular lead optimization in drug discovery, which involves refining lead compounds to enhance their ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties while maintaining their binding affinity to target proteins. The authors introduce TRACE, a trajectory-aware agent that utilizes large language model (LLM) reasoning to optimize molecular leads through sequential decision-making. Unlike traditional one-step optimization methods, TRACE considers the long-term implications of design decisions, allowing for more effective and robust optimization. The agent coordinates a diverse set of optimization tools under structural constraints, employs an in-context self-correction mechanism to stabilize its operations, and utilizes a similarity-guided trajectory reuse strategy to enhance efficiency. The experimental results demonstrate that TRACE outperforms baseline models in terms of optimization success, property improvements, and structural validity, while maintaining molecular similarity. This approach not only enhances the lead optimization process but also offers a scalable solution to the challenges faced in drug discovery.
Methodology
TRACE employs a large language model to guide the optimization process, formulating tool selection as a sequential decision-making problem. It integrates an in-context self-correction mechanism to refine tool instructions based on execution outcomes, and utilizes a multi-step evolutionary exploration strategy to accumulate beneficial modifications. Additionally, it incorporates a similarity-guided trajectory reuse strategy to leverage historical optimization experiences.
Results
The experiments conducted on various ADMET optimization tasks reveal that TRACE achieves higher optimization success rates, larger improvements in relevant properties, and greater structural validity compared to existing baseline models, while also preserving molecular similarity.
Implications
The TRACE framework has significant implications for drug discovery, providing a more efficient and effective method for lead optimization. Its ability to balance optimization effectiveness with computational efficiency could streamline the drug development process, potentially reducing costs and time associated with bringing new drugs to market.
HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization
Large Language Models
Efficient ML
Optimization
- HARP is a learnable two-sided orthogonal processor designed for extreme low-bit PTQ.
- It adapts the quantization basis to each layer and backend, improving robustness against outliers.
- HARP maintains full-precision equivalence and is compatible with existing Hadamard-based PTQ pipelines.
- The method shows significant improvements in perplexity and accuracy over fixed RHT across various model sizes.
Read more
HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization
Summary
The paper introduces HARP (Hadamard-preconditioned Adaptive Rotation Processor), a novel approach to post-training quantization (PTQ) for large language models (LLMs). HARP addresses the challenges of extreme low-bit quantization, which is sensitive to activation outliers and anisotropic weight curvature. Traditional methods using fixed randomized Hadamard transforms (RHTs) improve robustness but lack adaptability to specific layers or quantization distributions. HARP replaces fixed Hadamard mixing with a learnable structured orthogonal processor that adapts to each layer and quantizer backend while preserving full-precision equivalence. The processor utilizes sparse butterfly-like block-orthogonal stages and supports non-power-of-two dimensions. It is initialized to the RHT processor and refined using calibration data. The results demonstrate that HARP consistently improves perplexity and zero-shot accuracy across various bit settings (2-4 bits) on models ranging from 1B to 70B parameters, while maintaining high deployment efficiency.
Methodology
HARP employs a structured two-sided orthogonal processor that utilizes sparse butterfly-like block-orthogonal stages. It is initialized to a randomized Hadamard processor and refined using calibration data to adapt to specific layers and quantization backends. The methodology includes layerwise PTQ objectives that minimize output error using curvature approximations derived from calibration activations.
Results
HARP demonstrated consistent quality gains over fixed RHT, improving perplexity and zero-shot accuracy across 2-4 bit quantization settings on models ranging from 1B to 70B parameters. It also preserved high throughput, achieving 128 tokens per second, significantly faster than FP16.
Implications
The development of HARP has significant implications for deploying large language models in resource-constrained environments, enabling efficient quantization without sacrificing performance. This can facilitate broader adoption of LLMs in applications where memory and bandwidth are critical constraints.
Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning
Theory
- Introduces a horizon-free algorithm for stochastic decision-theoretic online learning under pure differential privacy.
- Achieves an optimal regret bound that resolves a COLT open problem regarding gap-dependent regret rates.
- Utilizes a novel approach of randomizing prefix lengths to control privacy and regret effectively.
- Clarifies the separation between statistical and privacy costs in the context of online learning.
Read more
Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning
Summary
This paper addresses the problem of stochastic decision-theoretic online learning under the constraints of pure differential privacy. Specifically, it tackles an open problem posed by Hu and Mehta regarding the optimal gap-dependent regret rate in this context. The authors present a horizon-free algorithm that achieves a regret bound of RegT ≤ 1000 · (log K / ∆min + log K / ε), where K is the number of actions, ∆min is the minimum gap between the best and second-best actions, and ε is the privacy parameter. The proposed algorithm operates by partitioning time into blocks of exponentially increasing sizes, where a single action is played throughout each block. The next action is determined using an exponential mechanism applied to a data-independent random prefix of the previous block, effectively converting block regret into a sum of softmax selection errors. The authors provide a direct factorization proof to ensure privacy, demonstrating that the statistical cost of identifying the best action and the privacy cost of eliminating suboptimal actions are distinct and clearly defined. This work not only resolves the gap-dependent pure-DP rate but also extends its applicability beyond the standard product-over-actions stochastic model to cases with arbitrary dependence among loss vectors.
Methodology
The authors developed a randomized-prefix version of the Follow-The-Leader algorithm, partitioning time into dyadic blocks. Within each block, a single action is repeated, and at the end of the block, a random prefix length is sampled to apply the exponential mechanism for action selection. This method ensures that privacy is maintained while minimizing regret through careful control of selection errors.
Results
The main result is the establishment of a regret bound of RegT ≤ 1000 · (log K / ∆min + log K / ε), which matches the known lower bound for the problem, thus closing the gap in the literature regarding optimal rates for private stochastic decision-making.
Implications
This work has significant implications for the design of algorithms in online learning scenarios where privacy is a concern, particularly in applications involving sensitive data. It provides a framework for achieving optimal performance while adhering to privacy constraints, which is crucial in fields such as finance, healthcare, and personalized services.
Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
Generative Models
Multimodal
Computer Vision
- Introduces a reward-free framework for text-to-image alignment in diffusion models.
- Addresses limitations of existing contrastive learning methods by providing explicit guidance for negative pairs.
- Achieves significant improvements in semantic consistency and counting accuracy in generated images.
- Compatible with existing diffusion model architectures, enhancing their performance without extensive retraining.
Read more
Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
Summary
This paper introduces Alignment-Guided Score Matching (AGSM), a novel framework aimed at improving text-to-image alignment in diffusion models. Traditional diffusion models excel in generating realistic images but often fail to align these images accurately with corresponding text prompts. Existing post-training methods rely heavily on external rewards or human preferences, which can lead to inconsistencies based on the quality of these signals. The authors propose a reward-free approach that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. This method addresses the limitations of previous contrastive learning approaches, which can excessively penalize negative pairs, leading to issues like over-counting and repetition in generated images. By employing a Plackett-Luce model for preference learning, AGSM provides explicit guidance for both positive and negative text-image pairs, enhancing the stability and coherence of the generated outputs. The experimental results demonstrate that AGSM matches the performance of the existing SoftREPA method while significantly reducing its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. The method is lightweight, model-agnostic, and can be seamlessly integrated into existing diffusion architectures.
Methodology
The authors propose a post-training framework that optimizes soft tokens using a Plackett-Luce model to derive alignment preferences from the diffusion model's internal log-likelihood. This approach integrates contrastive alignment guidance directly into the score-matching objective, allowing for explicit score-level guidance for both positive and negative text-image pairs, thus preventing off-manifold drift and enhancing generative fidelity.
Results
The proposed AGSM method matches the performance of SoftREPA while significantly improving its failure cases, achieving over 35% enhancement in counting accuracy on the GenEval benchmark. The method is shown to be effective across various diffusion model backbones, including SD1.5, SDXL, and SD3.
Implications
The findings suggest that AGSM can be effectively utilized to enhance text-to-image generation tasks in various applications, such as content creation, digital art, and interactive media, where precise alignment between text and visual content is crucial. Additionally, the lightweight and model-agnostic nature of AGSM allows for broader applicability across different generative models.
Sequential Physics-Constrained Neural Operator Forward Modeling for the Norne Reservoir System
Theory
Efficient ML
Time Series
- Development of a physics-constrained neural operator framework for reservoir modeling.
- Rigorous mathematical formulation addressing stability and convergence issues.
- Empirical validation shows high accuracy and efficiency compared to traditional simulators.
- Demonstration of the self-reinforcing cycle between physics constraints and training efficiency.
Read more
Sequential Physics-Constrained Neural Operator Forward Modeling for the Norne Reservoir System
Summary
This paper presents a novel framework for sequential surrogate modeling of three-phase black-oil reservoir dynamics using neural operators, specifically focusing on Fourier Neural Operators (FNO) and their physics-informed variant (PINO). The study targets the Norne benchmark reservoir, characterized by a heterogeneous grid and a production history of 30 timesteps over 3298 days. The authors address four interrelated theoretical challenges: (1) embedding the black-oil system in a product-Sobolev-space framework, proving well-posedness and establishing local Lipschitz estimates; (2) analyzing covariate shift and distributional divergence, showing how the Wasserstein-2 distance between true and predicted states grows; (3) demonstrating that PINO training reduces the spectral radius of the learned operator’s Jacobian, enhancing stability; and (4) formalizing K-step truncated backpropagation through time (TBPTT) as a biased stochastic gradient estimator, deriving optimal window sizes for training. Empirical results validate the theoretical predictions, showing high R2 values for oil and gas saturation and pressure, and significant speedup over traditional simulators, enabling practical applications in reservoir management and uncertainty quantification.
Methodology
The authors employ Fourier Neural Operators (FNO) and their physics-informed variant (PINO) to model reservoir dynamics. They establish a mathematical framework using product-Sobolev spaces, analyze covariate shifts, and derive stability conditions for the learned operators. The methodology includes K-step truncated backpropagation through time (TBPTT) for training the autoregressive models.
Results
The empirical results indicate that autoregressive PINO surrogates achieve R2 values greater than 0.99 for oil saturation, over 0.90 for gas saturation, and approximately 0.80 for pressure, with continuous improvement in water saturation predictions. The models demonstrate significant computational efficiency, achieving a 10,000-fold speedup over traditional simulators, enabling rapid Bayesian inversion and uncertainty quantification.
Implications
This research has significant implications for the petroleum industry, particularly in enhancing the efficiency of reservoir simulations, enabling faster decision-making processes in reservoir management, and facilitating advanced uncertainty quantification workflows.
Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents
Reinforcement Learning
Generative Models
Theory
- MF-Diffuser effectively scales offline MARL to thousands of agents by utilizing mean-field approximations.
- The framework combines trajectory planning with generative modeling in the Wasserstein space.
- Hierarchical coarse-to-fine planning allows for efficient population growth during the denoising process.
- Theoretical guarantees provide insights into suboptimality and convergence of the generated policies.
Read more
Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents
Summary
The paper introduces MF-Diffuser, a novel framework designed to scale offline multi-agent reinforcement learning (MARL) to environments with thousands of agents. Traditional diffusion-based planning methods excel in single-agent settings but struggle with the curse of dimensionality in multi-agent scenarios. MF-Diffuser addresses this challenge by modeling trajectory planning in the Wasserstein space of trajectory distributions, leveraging the propagation of chaos to approximate the behavior of a large population using a small representative subset of agents. The framework incorporates a value-weighted chaotic entropy objective that balances generative fidelity with return maximization, and employs a hierarchical coarse-to-fine strategy to incrementally increase the agent population during the denoising process. The authors establish theoretical guarantees regarding suboptimality and convergence, demonstrating that the mean-field approximation error scales as O(H²/√N) and that the generated policy approximates a mean-field Nash equilibrium. Experimental results across three benchmarks indicate that MF-Diffuser outperforms existing methods, particularly in scenarios with suboptimal offline data and large agent populations.
Methodology
MF-Diffuser employs a diffusion-based planning approach that models the trajectory distribution of agents in the Wasserstein space. It utilizes a mean-field value score matching objective to unify distributional fidelity and return maximization. The framework also implements a hierarchical planning strategy that starts with a small representative group of agents and progressively expands to the full population during the denoising process. Theoretical analysis is conducted to derive suboptimality bounds and convergence guarantees.
Results
The experiments demonstrate that MF-Diffuser achieves the highest returns in most tested scenarios, particularly excelling with suboptimal offline data and at extreme scales (N ≥ 1000). The theoretical analysis confirms that the mean-field approximation error diminishes with larger populations, and the generated policies converge to a mean-field Nash equilibrium.
Implications
The findings suggest that MF-Diffuser can be applied to various real-world multi-agent systems, such as traffic control, swarm robotics, and financial modeling, where decision-making is influenced by a large number of interacting agents. The framework's ability to handle large-scale data efficiently opens new avenues for research and application in offline reinforcement learning.
Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization
Optimization
Theory
- Identification of a consistent three-regime structure in SciML models: Well-Trained, Under-Trained, and Over-Trained.
- Optimization methods exhibit regime-specific effectiveness, necessitating tailored approaches for different training regimes.
- Fine-grained failure modes in SciML challenge traditional loss-landscape interpretations.
- Development of a regime-aware diagnostic framework that connects optimizer behavior to loss landscape features.
Read more
Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization
Summary
This paper investigates the multi-regime behavior of scientific machine learning (SciML) models, revealing that different hyper-parameter settings lead to distinct training regimes characterized by consistent performance within regimes and qualitative differences across them. The authors introduce a regime-aware diagnostic framework that analyzes model performance, training dynamics, and loss-landscape geometry. They identify a three-regime structure—Well-Trained, Under-Trained, and Over-Trained—common across various SciML models, optimization methods, and physical systems. The study finds that optimization effectiveness is regime-specific, with no single method performing optimally across all regimes. Additionally, the paper highlights fine-grained failure modes in SciML models that challenge conventional interpretations of loss-landscape metrics. The findings are validated on widely-used SciML models, including physics-informed neural networks, neural operators, and neural ordinary differential equations, across benchmarks of ordinary and partial differential equations. The results provide a unified perspective on failure modes in SciML and offer regime-aware guidance for enhancing model robustness.
Methodology
The authors developed a regime-aware diagnostic framework that systematically varies both physical and training regimes. They conducted empirical evaluations by analyzing model performance, training dynamics, and loss-landscape structures to create regime maps that reveal consistent patterns across different SciML models.
Results
The study confirmed the existence of a three-regime structure in SciML models, with distinct boundaries on training and test-error plots. The analysis showed that different optimization methods are effective only within specific regimes, and the models exhibited complex loss landscape structures that differ from typical computer vision models.
Implications
The findings suggest that a deeper understanding of regime-specific behaviors can lead to improved robustness in SciML applications. The regime-aware framework can guide practitioners in selecting appropriate optimization strategies and model designs based on the identified training regimes.
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Reinforcement Learning
NLP
Large Language Models
- Introduction of RL2ML, a family of finite-rollout surrogate objectives.
- Development of a closed-form, unbiased gradient estimator for RLVR.
- Identification of a subcritical-supercritical update-scale transition affecting training dynamics.
- Establishment of a one-dimensional optimization framework for selecting surrogate objectives.
Read more
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Summary
The paper introduces RL2ML, a novel family of finite-rollout surrogate objectives that bridges the gap between reinforcement learning (RL) and maximum likelihood (ML) training. It addresses the challenges faced by correctness-based Reinforcement Learning with Verifiable Rewards (RLVR), particularly the conflation of the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups. RL2ML provides a closed-form, unbiased gradient estimator that maintains estimator-objective alignment under a fixed rollout budget. The work reveals a subcritical-supercritical update-scale transition, which is crucial for understanding how empirical success counts influence stochastic updates. The analysis demonstrates that the optimal choice of surrogate objective is not solely determined by its proximity to maximum likelihood but is influenced by the evaluation metric, local sensitivity, and estimator variance. This leads to the formulation of the choice of the surrogate objective as a one-dimensional optimization problem, rather than treating it as an unconstrained hyperparameter.
Methodology
The paper develops the RL2ML framework by defining a truncated power-likelihood surrogate objective. It derives an unbiased estimator under finite rollout conditions and analyzes the update-scale geometry. The methodology includes calibrated local-gain analysis and variance decomposition to understand the effects of different surrogate objectives on training outcomes.
Results
The results indicate that the best choice of surrogate objective is context-dependent, influenced by the evaluation metric and local conditions rather than being a fixed parameter. The analysis shows that supercritical weighting of low-success prompts does not necessarily lead to optimal performance in finite-horizon training, suggesting a more nuanced approach to objective selection.
Implications
The findings have significant implications for the design of training objectives in RL, particularly in scenarios with limited rollout budgets. The insights can enhance the efficiency and effectiveness of training language models and other systems that rely on binary feedback, potentially improving their performance in real-world applications.
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
NLP
Large Language Models
Efficient ML
- Introduction of Opir, an efficient multi-task safety classification model for LLMs.
- Development of a comprehensive three-level safety taxonomy with 996 categories.
- Release of edge variants with fewer than 100M parameters for binary safe/unsafe classification.
- Competitive performance against eight contemporary guardrail systems with lower latency.
Read more
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
Summary
The paper presents Opir, a family of encoder-based guardrail models designed for real-time safety filtering in large language model (LLM) applications. Opir aims to efficiently classify unsafe prompts, toxic language, jailbreak attempts, and harmful content while maintaining a lower cost profile compared to existing large guardrail models. The models are built on the GLiClass architecture and include multi-task capabilities for binary safe/unsafe classification, multi-label toxicity classification, jailbreak detection, and zero-shot categorization of unsafe prompts and responses. The training data is derived from a comprehensive three-level safety taxonomy with 996 categories, combining unsafe prompts, adversarial hard negatives, and benign examples. The authors also release an evaluation harness for benchmarking against contemporary guardrail systems. Results indicate that Opir models are competitive with larger models across various safety classification tasks while achieving significantly lower latency, making them suitable for both cloud and edge deployments.
Methodology
Opir employs the GLiClass architecture to enable efficient multi-task classification. The models are trained on a diverse dataset that includes taxonomy-grounded unsafe prompts, adversarially mined hard negatives, and benign examples. The evaluation harness supports various safety classification tasks and benchmarks against existing guardrail systems.
Results
Opir models demonstrate competitive performance on 12 safety-classification tasks and 17 category tasks, often outperforming larger models while maintaining a smaller deployment footprint. Latency measurements show that encoder variants achieve sub-30 ms latency at 1024 tokens, with the smallest edge model achieving below 10 ms latency.
Implications
The development of Opir has significant implications for real-time content moderation in LLM applications, providing a cost-effective and efficient solution for detecting harmful content. Its deployment can enhance user safety in chatbots and other interactive AI systems, making it easier to implement robust guardrails without the high costs associated with larger models.
Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation
Theory
- Representation geometry significantly influences cross-Reynolds number generalization in neural PDE solvers.
- ConvAE-Relay demonstrates effective state matching without requiring target-regime fitting.
- Local and multi-scale representations outperform global linear or spectral representations.
- Autoregressive drift is identified as a primary bottleneck in predictive accuracy.
Read more
Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation
Summary
This paper investigates the challenges of cross-Reynolds number generalization in neural PDE solvers, particularly focusing on the forced 2D Navier–Stokes benchmark. The study reveals that a trained Fourier Neural Operator (FNO) achieves a relative L2 error of 46.68% under a 10× Reynolds-number shift, which is outperformed by zero-forward-model retrieval methods. This suggests that representation geometry plays a crucial role in the performance of different methods. The author introduces ConvAE-Relay, a convolutional autoencoder that matches states in a latent space and utilizes dynamics from a source-regime database, achieving a relative error of 38.34±0.07% without any target-regime fitting. The paper conducts ablation studies to isolate the impact of matching quality over update rules and confirms that the directions of source-regime dynamics remain transferable. Additionally, a U-Net with multi-scale skip connections is tested, yielding a relative error of 34.72±0.60%. The findings indicate that local and multi-scale representations are more effective than global linear or spectral representations for cross-Reynolds transfer. The study concludes by identifying autoregressive drift as a significant bottleneck in predictive performance, emphasizing the importance of representation geometry in neural PDE generalization.
Methodology
The paper employs a combination of learned predictors and zero-parameter retrieval methods to evaluate performance on the forced 2D Navier–Stokes benchmark. It introduces ConvAE-Relay for state matching in a convolutional autoencoder latent space and conducts ablation studies to analyze the effects of representation geometry and update rules. Oracle experiments are also performed to dissect error sources.
Results
The ConvAE-Relay method achieved a relative L2 error of 38.34±0.07%, outperforming PCA relay and demonstrating the effectiveness of learned representations. The U-Net model achieved a relative error of 34.72±0.60%. The study found that autoregressive drift accounted for approximately 12 percentage points of error, highlighting the challenges in long-horizon predictions.
Implications
The findings suggest that improving representation geometry could enhance the performance of neural PDE solvers in fluid dynamics and other engineering applications. The insights into autoregressive drift may inform future research on mitigating error accumulation in long-term predictions.
On-Policy Replay for Continual Supervised Fine-Tuning
NLP
Large Language Models
- Introduces On-Policy Replay (OPR) to enhance continual SFT without auxiliary losses.
- Demonstrates that on-policy signals can effectively reduce catastrophic forgetting in LLMs.
- Provides a label-free scoring method (OPR-SC) for constructing replay buffers.
- Achieves significant improvements in backward transfer metrics across multiple LLMs.
Read more
On-Policy Replay for Continual Supervised Fine-Tuning
Summary
This paper introduces On-Policy Replay (OPR), a novel approach to continual supervised fine-tuning (SFT) of large language models (LLMs) that addresses the issue of catastrophic forgetting. Traditional SFT methods often lead to significant performance degradation on previously learned tasks when new tasks are introduced. The authors argue that existing on-policy methods, which utilize a teacher model for supervision, inherit drawbacks such as additional computational overhead and potential stylistic drift. OPR circumvents these issues by directly utilizing the model's own outputs as training data. The method involves rolling out the latest model checkpoint on a limited set of historical prompts, filtering the generated responses based on task-specific rewards, and replaying the selected (prompt, response) pairs as standard SFT examples without the need for a teacher or auxiliary loss. The authors demonstrate that OPR consistently reduces forgetting across multiple instruction-tuned LLMs on the TRACE continual-learning benchmark, achieving significant improvements in backward transfer metrics compared to baseline methods. The findings suggest that the on-policy distribution of data is crucial for mitigating forgetting, rather than the quality of responses alone.
Methodology
The OPR method involves rolling out the current model checkpoint on a small budget of historical prompts, scoring the generated outputs based on task rewards, and retaining only the top-performing (prompt, response) pairs for subsequent training. This approach eliminates the need for a teacher model and auxiliary loss terms, maintaining the integrity of the SFT training loop.
Results
OPR was tested on three instruction-tuned LLMs (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) and consistently reduced backward transfer (BWT). In the most challenging scenario, OPR improved BWT from -13.93 to -2.29 at a 1% replay budget, representing a 46% reduction in BWT compared to a tuned Vanilla Replay baseline. OPR also outperformed existing methods like SDFT by 4.2 accuracy points without requiring a teacher forward pass.
Implications
The findings suggest that OPR can be a practical solution for deploying LLMs in dynamic environments where continual adaptation is necessary. By reducing catastrophic forgetting, OPR can enhance the performance of LLMs in real-world applications, particularly in scenarios with limited task annotations.
MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties
Theory
Efficient ML
- MōLe-Λ extends MōLe by predicting both T and Λ amplitudes, enhancing the range of molecular properties that can be computed.
- The model maintains the symmetry constraints of the original MōLe architecture while adding new readouts for Λ amplitudes.
- MōLe-Λ achieves CC-quality accuracy for energies and forces while recovering higher-order properties that standard models cannot.
- The computational efficiency of MōLe-Λ is significantly improved, being over two orders of magnitude faster than full CCSD calculations.
Read more
MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties
Summary
The paper introduces MōLe-Λ, an advanced machine learning framework that extends the capabilities of Molecular Orbital Learning (MōLe) to predict both right-hand (T) and left-hand (Λ) coupled-cluster amplitudes from localized Hartree-Fock molecular orbitals. Coupled-cluster (CC) theory is recognized as the gold standard in quantum chemistry for its accuracy in predicting molecular properties, but its high computational cost has limited its application to small systems. MōLe-Λ addresses this limitation by efficiently learning the CC response state, enabling the simultaneous prediction of energies, forces, and a wide range of molecular properties such as dipoles, quadrupoles, and polarizabilities. The architecture of MōLe-Λ retains the symmetry constraints of the original MōLe model while incorporating new readouts for Λ amplitudes. The empirical results demonstrate that MōLe-Λ not only achieves CC-quality accuracy in energies and forces but also significantly accelerates the computation compared to traditional CC methods, making it a promising approach for developing surrogate models in correlated quantum chemistry.
Methodology
MōLe-Λ employs a machine learning architecture that predicts coupled-cluster amplitudes from localized Hartree-Fock molecular orbitals. It integrates T and Λ amplitudes into a single model, preserving the symmetry constraints of the original MōLe framework. The model is trained on data derived from quantum chemical calculations to learn the relationships between molecular orbitals and the corresponding amplitudes.
Results
The MōLe-Λ model demonstrates high accuracy in predicting correlated energies and forces, achieving results competitive with established machine learning interatomic potentials. Additionally, it successfully recovers various molecular properties, including dipoles, quadrupoles, polarizabilities, and electron densities, which are critical for understanding molecular behavior. The model's speed advantage allows it to compute these properties much faster than traditional coupled-cluster methods.
Implications
The development of MōLe-Λ has significant implications for computational chemistry, as it provides a more efficient and accurate method for predicting a wide range of molecular properties. This could facilitate the exploration of larger and more complex molecular systems, ultimately advancing fields such as drug discovery, materials science, and molecular modeling.
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Interpretability
- Introduces an expert-augmented framework that combines machine learning with chemists' expertise for route evaluation.
- Utilizes a DeepSets-based model to assess synthetic routes based on tree edit distance and expert evaluations.
- Achieves a Spearman correlation coefficient of 0.78 and a Pearson correlation of 0.77 in category assessments.
- Demonstrates a top-1 ranking accuracy of 60.2%, significantly outperforming previous baselines.
Read more
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Summary
This paper addresses the challenge of selecting efficient multi-step synthetic routes in organic synthesis, particularly in medicinal and process chemistry. Traditional data-driven assessment systems often oversimplify the complex, multi-objective nature of synthesis design and rely on proxy datasets, which may not reflect optimal chemical reasoning. The authors propose an expert-augmented, data-driven scoring framework that integrates machine learning with chemists' domain knowledge to provide both numerical scores and interpretable qualitative assessments of synthetic routes. The framework employs a DeepSets-based model trained on tree edit distance between reference and machine-generated routes, which is then fine-tuned using expert evaluations. This approach allows for the prediction of feasibility and quality of synthetic routes while capturing the nuances of expert judgment. The resulting system demonstrates significant improvements in predictive accuracy compared to previous baselines, making it a valuable tool for chemists in route evaluation.
Methodology
The authors developed a DeepSets-based scoring model that processes synthetic routes by comparing them to reference routes using tree edit distance. The model is initially trained on a large dataset of patent routes and then fine-tuned with expert evaluations to enhance its predictive capabilities. This dual approach allows the model to output both quantitative scores and qualitative assessments of route feasibility.
Results
The expert-augmented framework achieved a Spearman correlation of 0.782 ± 0.050 and a Pearson correlation of 0.769 ± 0.064 with expert ratings. The model also reached a top-1 ranking accuracy of 60.2% in score prediction, significantly surpassing the previous baseline of 17.5%. Additionally, it provided interpretable qualitative evaluations of synthetic routes, demonstrating its effectiveness in capturing expert judgment.
Implications
This framework has the potential to streamline the evaluation process in organic synthesis, making it more efficient and scalable. By integrating expert knowledge with machine learning, it can assist chemists in making informed decisions about synthetic routes, ultimately enhancing drug discovery and development processes.
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Theory
Graph Learning
Efficient ML
- Restricting regressors to the Markov boundary can improve prediction accuracy, especially in larger and sparser feature spaces.
- Causal discovery methods often fail to effectively recover the Markov boundary for practical predictive tasks.
- The predictive costs of false negatives and false positives in feature selection are not equal, complicating the recovery process.
- Many feature sets can outperform the full feature set, indicating that the exact Markov boundary is not the only viable option.
Read more
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Summary
This paper investigates the utility of the Markov boundary in tabular prediction, which is defined as the minimal set of features that renders all other features redundant for predicting a target variable. The authors analyze the performance of regressors trained on the Markov boundary versus those trained on the full feature set using a synthetic benchmark called SCM3K, which includes 3,450 tasks with varying feature counts. The findings reveal that while restricting regressors to the Markov boundary often improves prediction accuracy, the process of discovering this boundary through causal discovery methods does not yield the expected benefits. The authors identify three main reasons for this discrepancy: causal discovery prioritizes structural recovery over predictive accuracy, the costs of false positives and false negatives in feature selection are asymmetrical, and alternative feature sets can also outperform the full feature set. The paper suggests a need for prediction-aligned feature selection methods and highlights the complexities involved in effectively utilizing causal structures in tabular models.
Methodology
The authors conducted experiments using SCM3K, a controlled benchmark comprising 3,450 synthetic tasks with varying feature counts. They evaluated six different regressors, comparing their performance when trained on the full feature set versus the oracle Markov boundary. The difference in test error between these two approaches was termed the MB gap.
Results
The results indicated that restricting to the Markov boundary generally improved prediction for most regressors, with the MB gap increasing as the feature space became larger and sparser. However, attempts to recover the boundary through causal discovery methods often fell short, as these methods consumed computational resources without achieving significant predictive gains.
Implications
The findings suggest that while the Markov boundary has theoretical advantages, practical applications in tabular prediction require more effective feature selection strategies that prioritize predictive performance. This has implications for the design of future machine learning models and methods in feature selection.
Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback
Large Language Models
Interpretability
Optimization
- IGSR improves symbolic regression by providing granular feedback on term contributions.
- The method combines LLM-generated candidate functions with influence scores for effective pruning.
- Integration with MCTS allows for efficient exploration of the equation search space.
- Demonstrated effectiveness on diverse datasets, including real-world biological applications.
Read more
Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback
Summary
This paper introduces Influence-Guided Symbolic Regression (IGSR), a novel method that enhances the application of Large Language Models (LLMs) in symbolic regression by addressing the limitations of existing search strategies and feedback mechanisms. Traditional approaches often rely on scalar metrics, such as global Mean Squared Error, which do not provide insights into the contributions of individual components of an equation. IGSR proposes a two-step iterative process where an LLM generates candidate basis functions for a linear model, which are then evaluated using granular influence scores that quantify each term's contribution to generalization accuracy. This allows for a systematic pruning process that refines the model structure. The integration of IGSR with a Monte Carlo Tree Search (MCTS) framework facilitates effective navigation of the combinatorial search space, balancing exploration of new functional forms with exploitation of high-influence components. The effectiveness of IGSR is demonstrated across various benchmarks, including pharmacological models and genomic data, culminating in a case study that identified a novel relationship between DNA methylation and RNA Polymerase II pausing, which was validated through wet-lab experiments.
Methodology
The methodology involves an iterative two-step process where an LLM generates candidate basis functions, which are evaluated using granular influence scores to assess their contributions to model accuracy. This feedback informs a pruning process that refines the model structure. The approach is integrated into a Monte Carlo Tree Search framework to balance exploration and exploitation in the search for optimal equations.
Results
IGSR was validated on a suite of benchmarks, showing significant improvements in discovering interpretable mathematical models. A notable case study revealed a new hypothesis regarding the relationship between DNA methylation and RNA Polymerase II pausing, which was confirmed through subsequent experimental work.
Implications
The findings suggest that IGSR can facilitate scientific discovery in various fields by enabling the identification of interpretable models that capture complex relationships in data. This approach could be particularly beneficial in domains requiring mechanistic insights, such as biology and epidemiology.
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
NLP
- Introduction of the State Value Estimation Benchmark (SVEB) to assess state value estimation methods in LLMs.
- Identification of limitations in standard PPO approaches, which often yield coarse group-average state values.
- Development of Numca, a heuristic leveraging numerical spans for effective state value estimation.
- Proposal of Hista, a general framework using hidden states for improved state value estimation.
Read more
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Summary
This paper addresses the challenge of accurate state value estimation in reinforcement learning (RL) for large language models (LLMs). The authors introduce the State Value Estimation Benchmark (SVEB) to evaluate existing RL frameworks, revealing that standard approaches like Proximal Policy Optimization (PPO) often regress to a coarse group-average baseline, undermining their effectiveness. To improve state value estimation, the authors propose two novel techniques: Numca, which utilizes numerical spans as gradable milestones for estimating state values, and Hista, a framework that employs LLM hidden states to compute a weighted average of disjoint rollouts and their returns. Extensive experiments demonstrate that both methods significantly enhance state value accuracy and improve training performance across various RL algorithms and model sizes, without incurring substantial computational costs. The paper concludes with a theoretical guarantee that Hista provides better estimates than traditional methods, thus offering a more effective solution for RL in LLMs.
Methodology
The authors constructed the SVEB to quantify discrepancies in state value estimation methods. They proposed two techniques: Numca, which groups disjoint rollouts based on numerical milestones, and Hista, which uses hidden state embeddings to compute weighted averages of rewards. Theoretical proofs and extensive empirical evaluations were conducted to validate the effectiveness of these methods.
Results
Both Numca and Hista outperformed standard PPO and other baseline methods in terms of state value accuracy and training performance across different datasets and model sizes. Hista, in particular, demonstrated a theoretical advantage over traditional group-average methods, leading to better convergence and reward outcomes.
Implications
The findings suggest that accurate state value estimation is crucial for effective RL training in LLMs. The proposed methods can be applied to enhance various RL algorithms, potentially leading to more robust and efficient training processes in natural language processing tasks.
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit
Large Language Models
Theory
Efficient ML
- Introduces a conservative paired MDE bound for quantization comparisons.
- Demonstrates that much of the perceived unreliability in benchmarks is due to binomial sampling noise.
- Presents a Quantization Reliability Index (QRI) to assess signal-to-noise ratios in quantization studies.
- Recommends pre-registering MDE targets and reporting discordant counts for improved reliability.
Read more
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit
Summary
This paper presents a planning-method note that introduces a conservative minimum detectable effect (MDE) bound for quantization benchmarks, specifically for FP16 vs. NF4 comparisons. The authors adapt classical paired-binary sample-size calculations to create a one-line budget that benchmark designers can use to assess the reliability of their quantization claims before conducting experiments. The study includes a pilot audit involving four models and four benchmarks, demonstrating that much of the reported benchmark unreliability is attributable to binomial sampling noise rather than inherent model issues. The paper also introduces a Quantization Reliability Index (QRI) to help identify where evaluation noise may overshadow quantization signals. The findings suggest that pre-registration of MDE targets and careful reporting of discordant counts can enhance the reliability of quantization studies.
Methodology
The authors derive a paired minimum detectable effect (MDE) bound using classical statistical methods and conduct an empirical audit on four models and benchmarks. They measure accuracy and cross-split standard deviations, comparing observed variances to binomial references to distinguish between sampling noise and actual model performance. Additionally, they introduce the Quantization Reliability Index (QRI) as a diagnostic tool.
Results
The audit revealed that 25 out of 32 observed cross-split standard deviations fell within ±1.5 percentage points of the expected binomial reference, indicating that much of the reported unreliability is due to sampling noise. The study found that the largest observed quantization delta (3.2 percentage points) was below the implied MDE at a disagreement rate of 0.10 but exceeded it at a rate of 0.05, illustrating the trade-offs in planning.
Implications
The findings emphasize the need for rigorous statistical planning in quantization studies, particularly in the context of large language models. By establishing clear MDE targets and understanding the sources of noise in benchmark evaluations, researchers can improve the reliability of their quantization claims and better assess the performance of quantized models.
ExDBSCAN: Explaining DBSCAN with Counterfactual Reasoning -- Additional Material
Interpretability
- ExDBSCAN is the first method specifically designed to generate counterfactual explanations for DBSCAN clustering.
- The method provides both noise-to-cluster and cluster-to-cluster transition explanations.
- ExDBSCAN employs a physics-inspired model to ensure diversity and proximity in counterfactual generation.
- Empirical results show that ExDBSCAN outperforms four baseline methods while achieving perfect validity.
Read more
ExDBSCAN: Explaining DBSCAN with Counterfactual Reasoning -- Additional Material
Summary
The paper introduces ExDBSCAN, a novel method aimed at enhancing the explainability of the DBSCAN clustering algorithm through counterfactual reasoning. While clustering techniques like DBSCAN are widely used for their ability to identify clusters and outliers, they often lack interpretability, making it difficult to understand the rationale behind cluster assignments. ExDBSCAN addresses this gap by generating actionable counterfactual explanations that illustrate how slight modifications to data points could change their cluster assignments. The method employs a physics-inspired model that creates a density-connected weighted graph, allowing for the generation of diverse and proximal counterfactuals. The approach ensures that each counterfactual maintains validity, meaning it accurately reflects the correct cluster assignment. Empirical evaluations on 30 tabular datasets demonstrate that ExDBSCAN outperforms existing baseline methods while achieving perfect validity and providing diverse explanations. This work not only enhances the interpretability of DBSCAN but also offers insights into the underlying structure of data, potentially aiding practitioners in decision-making processes related to data preprocessing and feature engineering.
Methodology
ExDBSCAN utilizes a density-connected weighted graph to model the clustering structure of DBSCAN. It treats candidate counterfactuals as charged particles that repel each other to ensure diversity while being attracted to the original instance to maintain proximity. This physical optimization system allows for the generation of valid counterfactuals that respect feature constraints.
Results
The empirical evaluation of ExDBSCAN on 30 tabular datasets indicates that it outperforms all baseline methods in terms of generating valid and diverse counterfactuals. The method achieves perfect validity, ensuring that all generated counterfactuals accurately reflect the correct cluster assignments.
Implications
ExDBSCAN has significant implications for enhancing the interpretability of clustering algorithms, particularly in fields where understanding data groupings is crucial, such as bioinformatics, fraud detection, and computer vision. By providing actionable insights into cluster membership, it can assist practitioners in making informed decisions regarding data analysis and model refinement.
TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models
Generative Models
- TaxDistill is a knowledge distillation framework specifically designed for metagenomic taxonomic annotation.
- The framework utilizes GenomeOcean as a teacher network to mitigate label noise from traditional sequence retrieval methods.
- TaxDistill achieves significant improvements in classification performance across various microbial datasets.
- The method incorporates uncertainty awareness, allowing for high reliability in real-world applications.
Read more
TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models
Summary
The paper introduces TaxDistill, a novel framework aimed at enhancing metagenomic taxonomic annotation through knowledge distillation. Traditional methods for taxonomic annotation often rely on sequence similarity searches, which struggle with high microbial diversity and incomplete reference databases, leading to noisy labels that degrade classification performance. TaxDistill addresses this issue by employing GenomeOcean, a 500M parameter genomic foundation model, as a teacher network to extract deep semantic features and generate soft labels with confidence scores. This soft label information is distilled into a lightweight student network, effectively reducing label noise from initial retrieval tools. The authors conducted comprehensive experiments on seven diverse CAMI2 datasets, demonstrating that TaxDistill consistently outperforms existing baselines, including the Taxometer model. For instance, it improved the F1 score of MMseqs2 from 0.763 to 0.941 on the Gastrointestinal dataset. The study highlights the effectiveness of soft label distillation in providing uncertainty awareness and strict false positive control, making TaxDistill a reliable method for complex metagenomic analysis.
Methodology
TaxDistill employs a knowledge distillation approach where GenomeOcean serves as the teacher network to extract features and generate soft labels. These soft labels are then distilled into a lightweight student network, which retains the architecture of existing models like Taxometer but benefits from reduced label noise and enhanced semantic understanding.
Results
Experimental results show that TaxDistill outperforms baseline models in most scenarios, with notable improvements such as an increase in the F1 score from 0.763 to 0.941 on the Gastrointestinal dataset when compared to MMseqs2. The framework demonstrated effectiveness in handling complex microbial environments and achieving reliable taxonomic annotations.
Implications
The findings suggest that TaxDistill can significantly improve the accuracy of metagenomic analyses, particularly in clinical and environmental microbiome studies. Its ability to manage label noise and enhance classification reliability could lead to better pathogen detection and microbial community profiling.
Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data Streams
Theory
Time Series
Efficient ML
- Proposes a novel unsupervised method for detecting concept drift and recognizing novel classes in data streams.
- Utilizes mirrored autoencoders for independent adaptation to changing data distributions.
- Demonstrates competitive performance against state-of-the-art methods through experiments on synthetic data.
- Addresses the challenges of real-time data processing in the presence of concept drift and novel class emergence.
Read more
Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data Streams
Summary
This paper addresses the challenges of concept drift and novel class recognition in non-stationary data streams, which are prevalent in modern machine learning applications. The author proposes an unsupervised method for detecting concept drift by analyzing reconstruction errors from autoencoders, while also recognizing novel class samples through density estimation. The use of mirrored autoencoders allows for independent adaptation to changes in data distribution, facilitating continuous learning and reliable identification of unknown samples. Experiments conducted on synthetic tabular data streams demonstrated the effectiveness of the proposed approach, showing competitive performance compared to existing state-of-the-art unsupervised drift detection and novelty classification methods. The findings highlight the importance of adaptive techniques in handling dynamic data environments, where both known and unknown classes may evolve over time.
Methodology
The proposed method employs autoencoders to detect concept drift by analyzing reconstruction errors and utilizes density estimation for recognizing novel classes. Mirrored autoencoders are implemented to allow for independent incremental adaptation to evolving data distributions.
Results
The experimental results indicate that the proposed approach effectively detects concept drifts and identifies novel classes, outperforming or matching the performance of existing unsupervised drift detectors and novelty classifiers across various synthetic tabular data streams.
Implications
The findings suggest that the proposed method can be applied in real-time data processing systems where concept drift and novel class emergence are common, such as in financial markets, online recommendation systems, and dynamic user behavior analysis.
Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs
Graph Learning
Time Series
Theory
- Introduction of Dual-Scale Retentive Dynamics (DSRD) framework for dynamic graphs.
- Unified retentive state that captures both temporal and structural dependencies.
- Adaptive decay kernels with learnable parameters for balancing short-term and long-term dependencies.
- Theoretical insights into stability and boundedness of the model.
Read more
Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs
Summary
This paper addresses the challenges of representation learning on dynamic graphs, which require capturing complex dependencies that evolve over time and structure. Existing methods often rely on fixed temporal decay schemes and predetermined structural propagation depths, limiting their generalization capabilities across diverse graphs. The authors propose a novel framework called Dual-Scale Retentive Dynamics (DSRD), which integrates temporal memory and structural context through a unified retentive state. DSRD features dual-scale adaptation that models both temporal dynamics and structural propagation within a single recurrent formulation, along with adaptive decay kernels that learn time-sensitivity parameters to balance short-term responsiveness and long-term retention. Theoretical analyses establish the equivalence between event-wise parallel aggregation and efficient recurrent state updates, ensuring stability and boundedness of the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD outperforms state-of-the-art methods in link prediction and node classification tasks, showcasing strong generalization in both transductive and inductive settings.
Methodology
The DSRD framework employs a retentive state that integrates temporal memory and structural diffusion through dual-scale adaptation. It utilizes learnable decay kernels to adjust the sensitivity of the model to different temporal patterns, allowing for adaptive balancing between short-term and long-term dependencies. The framework is analyzed theoretically to ensure stability and efficiency in representation learning.
Results
DSRD consistently achieved state-of-the-art performance on link prediction and node classification tasks across 14 benchmark datasets, demonstrating robust generalization capabilities in both transductive and inductive settings.
Implications
The proposed DSRD framework has significant implications for various applications involving dynamic graphs, such as social network analysis, traffic systems, and knowledge graphs, where understanding temporal and structural dynamics is crucial for effective representation learning.
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Graph Learning
Optimization
Efficient ML
- LoRe introduces a training-free, inference-time wrapper for iterative graph solvers that enforces per-step interaction budgets.
- The method dynamically routes computation to prioritize high-conflict interactions, improving efficiency over static sparsification techniques.
- Empirical results show LoRe achieves up to 15× speedup and 44× memory reduction on the Traveling Salesperson Problem.
- The approach demonstrates cross-task generality and robustness to topology shifts in large-scale combinatorial optimization problems.
Read more
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Summary
The paper introduces LoRe, a novel approach designed to enhance the efficiency of diffusion-based neural solvers for combinatorial optimization problems, particularly in scenarios where inference time and memory usage are critical. Traditional solvers face challenges due to the need for dense evaluations of interactions, which can lead to high computational costs and memory constraints. LoRe addresses this by implementing a per-step interaction-evaluation budget that dynamically routes computation to high-conflict or high-uncertainty interactions, rather than relying on static sparsification methods. This approach is inspired by methodologies from many-body physics, specifically the Cluster-Bath decomposition, allowing for a more adaptive and efficient evaluation process. The authors validate LoRe through extensive experiments, demonstrating significant improvements in scalability and performance on problems such as the Maximum Independent Set (MIS) and the Traveling Salesperson Problem (TSP). The results indicate that LoRe can extend feasible inference beyond traditional out-of-memory limits while maintaining solution quality, achieving substantial speedups and reductions in peak memory usage.
Methodology
LoRe employs a Cluster-Bath decomposition inspired by many-body physics, where at each iteration, a subset of high-conflict interactions is evaluated exactly while the influence of omitted interactions is approximated through a lightweight global recall signal. This allows for dynamic, state-dependent routing of computations without the need for retraining.
Results
LoRe significantly improves the scalability of iterative graph solvers, extending feasible inference on the Maximum Independent Set problem by more than 3× beyond baseline out-of-memory limits. It achieves approximately 8× speedup and 12× peak-memory reduction while maintaining solution quality. For the Traveling Salesperson Problem, it delivers a 15× speedup and a 44× reduction in memory usage.
Implications
The findings suggest that LoRe can be effectively applied in real-time decision-making systems that require rapid and memory-efficient combinatorial optimization, such as dynamic vehicle routing and resource allocation in data centers. Its ability to adaptively manage computational resources could lead to broader applications in various fields requiring iterative refinement under strict constraints.
Learning to Perturb Hidden Representations for Generalizable Deep Learning
Theory
- Establishes a unified framework for hidden activation perturbation.
- Introduces Learning to Perturb Activations (LPA) for adaptive class-level perturbations.
- Theoretically connects activation perturbation to flat minima and perturbation amplification.
- Demonstrates that LPA outperforms existing methods across various classification scenarios.
Read more
Learning to Perturb Hidden Representations for Generalizable Deep Learning
Summary
This paper addresses the lack of systematic analysis on perturbations of hidden activations in deep neural networks, which are crucial for the network's computation. The author establishes a unified framework for hidden activation perturbation, showing that existing methods like Dropout, Manifold Mixup, and adversarial feature perturbation can be understood as specific forms of activation perturbation. The paper introduces the Learning to Perturb Activations (LPA) method, which adaptively perturbs activations at a chosen hidden layer using class-level perturbations learned via Projected Gradient Descent (PGD). The author theorizes that expansive perturbation serves as positive augmentation while contractive perturbation acts as negative augmentation, with the perturbation layer influencing whether the effect resembles input-level or logit-level manipulation. The experiments conducted across balanced classification, long-tail classification, and domain generalization demonstrate that LPA consistently outperforms existing methods and complements logit perturbation techniques.
Methodology
The paper proposes Learning to Perturb Activations (LPA), which adaptively perturbs hidden-layer activations using class-level perturbations learned through Projected Gradient Descent (PGD). The perturbation direction and magnitude are determined based on whether each class requires positive or negative augmentation, with a focus on the effects of perturbation at different layers of the network.
Results
Experiments show that LPA consistently improves performance in balanced classification, long-tail classification, and domain generalization tasks, outperforming existing perturbation methods and providing complementary benefits to logit perturbation techniques.
Implications
The findings suggest that a more nuanced approach to perturbing hidden activations can enhance the generalization and robustness of deep learning models. This could lead to improved performance in various applications, particularly in scenarios involving class imbalance and domain shifts.
Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization
Graph Learning
Optimization
Robotics
- Introduces a graph-learning-aided optimization approach for space debris capture.
- Transforms a complex MCNLP problem into a simpler NLP problem using GNNs.
- Demonstrates faster convergence to optimal solutions compared to traditional methods.
- Validates the framework through practical design scenarios for tether-net systems.
Read more
Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization
Summary
This paper addresses the challenge of capturing large non-cooperative space debris using active tether-net systems. The authors propose a novel computational framework that integrates graph-learning techniques with mixed-combinatorial optimization to efficiently explore the design and control choices of the tether-net system. The complexity of the optimization problem arises from the mixture of continuous, integer, and categorical variables, which traditional binary encoding methods struggle to solve. The authors introduce a Graph Neural Network (GNN) that scores and recommends candidate combinations represented as nodes in a graph, effectively transforming the Mixed Combinatorial Nonlinear Programming (MCNLP) problem into a more manageable Nonlinear Programming (NLP) problem. The GNN-based approach is demonstrated to significantly enhance convergence speed towards optimal solutions when compared to direct MCNLP problem-solving methods. The framework is validated through the design of the net morphology, mass and thruster selection for maneuverable units, and aiming points for the tether-net system's controller, showcasing its potential for improving active debris removal strategies.
Methodology
The authors developed a Graph Neural Network (GNN) to score and recommend candidate design combinations, which are represented as nodes in a graph. This GNN approach allows the transformation of the MCNLP optimization problem into an NLP problem, which can be solved using standard optimization techniques, specifically employing a Particle Swarm Optimization (PSO) algorithm with gradient-based fine-tuning.
Results
The GNN-based recommender system demonstrated significantly faster convergence to optimal solutions for the design of tether-net systems compared to direct solutions of the MCNLP problem. The framework was successfully applied to optimize the morphology of the net, the selection of mass and thrusters for maneuverable units, and the aiming points for the controller.
Implications
The proposed framework has significant implications for the field of active debris removal, offering a systematic approach to design and control that could enhance the effectiveness and efficiency of space debris capture missions. Additionally, the methodology could be adapted for automated guided design in other autonomous spacecraft systems.
The Hamilton-Jacobi Theory of Deep Learning
Theory
Optimization
- Training a neural network is framed as solving Hamilton-Jacobi initial-value problems.
- A single parameter ε unifies different perspectives on neural networks, tropical algebra, and PDEs.
- The paper establishes a minimax optimal generalization rate and certifiable adversarial robustness.
- Backpropagation is shown to correspond to the co-state equation of the Hamiltonian system for residual networks.
Read more
The Hamilton-Jacobi Theory of Deep Learning
Summary
This paper presents a novel perspective on deep learning by framing the training of neural networks as a search through Hamilton-Jacobi initial-value problems. The authors establish that each gradient step corresponds to selecting initial data of a viscous Hamilton-Jacobi equation, with the Hopf-Cole propagator providing the best fit for observations. The paper identifies a precise correspondence between various neural network architectures (including residual networks, transformers, and recurrent networks) and Hamilton-Jacobi equations, revealing that these architectures discretize the same class of equations with architecture-dependent Hamiltonians and viscosity. A single deformation parameter, ε, is introduced to unify the perspectives of neural networks, tropical algebra, viscous PDEs, and convex optimization. The authors derive several quantitative consequences, including a minimax optimal generalization rate, adversarial robustness controlled by ε, and a closed-form influence function for softmax attribution weights. The findings culminate in a comprehensive mathematical theory of deep learning, linking previously isolated concepts and providing actionable design principles for neural network architectures.
Methodology
The authors utilize mathematical frameworks from Hamilton-Jacobi theory, tropical algebra, and convex optimization to establish the connections between neural network architectures and viscous Hamilton-Jacobi equations. They employ the Hopf-Cole linearization and Maslov dequantization to derive relationships between activation functions and PDE solutions, leading to a commutative diagram that encapsulates the theoretical underpinnings of deep learning.
Results
The paper demonstrates that the log-sum-exp activation function serves as a smooth deformation of the tropical max operation, and that every LSE-activated layer encodes the exact Hopf-Cole solution of a viscous Hamilton-Jacobi PDE. The authors derive a minimax optimal generalization rate of O(n−1/(d+2)), establish conditions for adversarial robustness, and provide a closed-form influence function for softmax weights, which exhibits complex bifurcation behavior as ε varies.
Implications
This framework offers a deeper understanding of the mathematical foundations of deep learning, potentially guiding the design of more robust and efficient neural network architectures. The insights into generalization rates and adversarial robustness could inform future research and practical applications in various domains of machine learning.
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
NLP
Large Language Models
Generative Models
- Introduction of confidence-induced clusters (CICs) as span-level update units for decoding in MDLMs.
- Development of CLAD, a training-free cluster-level decoder that enhances parallelism in decoding.
- Utilization of self-attention maps to model inter-cluster dependencies and avoid conflicts during decoding.
- Demonstrated significant speedups (1.77×–8.47×) over traditional token-level decoding methods.
Read more
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
Summary
This paper introduces CLAD (Cluster-Level Attention-Guided Decoding), a novel approach for decoding in Masked Diffusion Language Models (MDLMs). Traditional decoding methods operate at the token level, which limits parallelism and efficiency. The authors propose a new granularity for decoding by defining confidence-induced clusters (CICs), which are contiguous spans of high-confidence masked positions. By leveraging self-attention maps, CLAD estimates inter-cluster dependencies to avoid conflicts when committing multiple CICs simultaneously. This allows for more aggressive parallel commitments without sacrificing accuracy. The authors demonstrate that CLAD significantly improves decoding speed while maintaining comparable task performance across various benchmarks, thus showcasing the effectiveness of span-level updates over token-level updates in MDLMs.
Methodology
The methodology involves defining confidence-induced clusters (CICs) as maximal contiguous spans of high-confidence masked positions. CLAD converts token-level confidence estimates into CIC-level candidates and constructs a sparse conflict graph using attention-derived inter-cluster dependencies. The decoder then selects a maximum-weight set of non-conflicting CICs for parallel commitment, optimizing throughput while preserving accuracy.
Results
CLAD achieves speedups of 1.77× to 8.47× compared to Vanilla decoding methods across four benchmarks related to mathematical reasoning and code generation. It also outperforms token-level dependency-aware baselines in most evaluated settings, indicating its effectiveness in improving throughput without significant accuracy loss.
Implications
The proposed approach has potential applications in enhancing the efficiency of language model decoding, particularly in scenarios requiring real-time processing or high throughput. It may also influence future research on decoding strategies in generative models and other NLP tasks.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
NLP
Large Language Models
Efficient ML
- Introduction of a tree-based block diffusion drafting method for speculative decoding.
- Dynamic construction of query-dependent trees to optimize decoding speed and quality.
- Integration of an acceptance surrogate, online latency estimator, and adaptive expansion mechanism.
- Achieves up to 6.61× speedup over standard autoregressive decoding.
Read more
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
Summary
This paper presents BASTION, a novel framework for speculative decoding that enhances the efficiency of generating sequences in large language models (LLMs). Traditional autoregressive decoding methods are limited by their sequential nature, which incurs high computational costs due to the need for multiple forward passes of the target model for each generated token. BASTION addresses this by employing a tree-structured block diffusion drafting approach that allows for parallel predictions of multiple future tokens. Unlike existing methods that utilize static tree topologies, BASTION dynamically constructs query-dependent trees that balance the quality of drafts with hardware constraints. The framework comprises three main components: an acceptance surrogate for estimating expected accepted lengths, an online latency estimator for calibrating a hardware-aware roofline model, and an adaptive best-first expansion mechanism that optimally grows the tree based on marginal gains. The results demonstrate that BASTION achieves significant speedups, outperforming standard autoregressive decoding by up to 6.61 times and surpassing state-of-the-art block-diffusion methods by 39%. This work highlights the potential for dynamic tree structures in improving the efficiency of speculative decoding in LLMs.
Methodology
BASTION employs a tree-structured approach to speculative decoding, where it dynamically constructs trees based on the predicted distributions of future tokens. The methodology includes an acceptance surrogate to estimate the expected length of accepted sequences, an online latency estimator that uses a roofline model to assess hardware constraints, and an adaptive best-first expansion strategy that grows the tree until the cost of verification outweighs the marginal gains.
Results
BASTION achieved an average speedup of 6.61× over standard autoregressive decoding and a 39% improvement over existing block-diffusion baselines across various benchmarks, including math, code generation, and chat datasets. The results were consistent across different model architectures and GPU setups.
Implications
The findings suggest that dynamic tree-structured drafting can significantly enhance the efficiency of speculative decoding in LLMs, making it feasible for real-time applications and reducing computational costs. This approach may lead to broader applications in natural language processing tasks where speed and resource efficiency are critical.
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Theory
Large Language Models
- FormInv reveals significant flaws in existing mathematical reasoning benchmarks regarding semantic invariance.
- Accuracy metrics can be misleading, as they do not account for inconsistencies across semantically equivalent paraphrases.
- The proposed invariance framework formalizes semantic invariance using SCR and Cochran’s Q.
- FormInv includes a comprehensive benchmark and an algorithm for model selection based on semantic consistency.
Read more
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Summary
The paper introduces FormInv, a novel measurement protocol designed to assess semantic invariance in mathematical reasoning benchmarks. The authors highlight significant flaws in existing benchmarks, particularly in their ability to evaluate models' performance across semantically equivalent paraphrases. Through a paraphrase-quality audit of MathCheck, they discovered that 3.1% of paraphrases were semantically incorrect, which affected model rankings significantly. The study reveals that accuracy metrics alone are insufficient, as they can mask substantial discrepancies in semantic consistency rates (SCR). For instance, Claude Haiku 4.5 achieves 86% accuracy but has an SCR of only 50%, indicating that half of its theorems are inconsistently answered when rephrased. The authors propose an invariance framework that includes SCR and Cochran’s Q as primary measures, and they present FormInv, a benchmark consisting of 760 items across various paraphrase families. Additionally, they introduce FormInvSelector, an algorithm for regime-aware model selection based on SCR profiles. The findings underscore the necessity for benchmarks to incorporate logical-equivalence verification and cross-model unanimity to ensure a more accurate assessment of mathematical reasoning capabilities in AI models.
Methodology
The authors conducted a paraphrase-quality audit on existing benchmarks, particularly MathCheck, to identify semantically incorrect paraphrases. They developed the FormInv protocol, which includes SCR and Cochran’s Q as measures of semantic invariance. The benchmark consists of 760 items spanning various paraphrase families, evaluated across nine models. The FormInvSelector algorithm was created to recommend models based on their SCR profiles.
Results
The audit found that 3.1% of paraphrases in MathCheck were semantically incorrect, leading to significant changes in model rankings. The study demonstrated that models like Claude Haiku 4.5 and DeepSeek V3 exhibited large discrepancies between accuracy (86% and 96.4%, respectively) and SCR (50% and 82%, respectively). The FormInv protocol achieved 100% recall when replicated on external benchmarks, highlighting its effectiveness in identifying semantic inconsistencies.
Implications
The findings suggest that current benchmarks for mathematical reasoning in AI need to be revised to include measures of semantic invariance. This could lead to more reliable assessments of AI models' reasoning capabilities, ultimately improving their application in formal mathematical contexts. The FormInv protocol could serve as a standard for future evaluations in this domain.
One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them
NLP
Large Language Models
- Knowledge editing methods ROME and MEMIT modify MLP weights to update facts without retraining.
- A common subset of weights is critical for maintaining edits, which can be isolated using a binary mask.
- The edits suppress original knowledge retrieval rather than overwriting it, leading to limitations in propagating changes.
- Injecting the identified mask prior to editing drastically reduces the success rate of edits.
Read more
One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them
Summary
This paper investigates the internal mechanisms of knowledge editing methods, specifically ROME and MEMIT, which modify MLP weights in transformer models to update factual associations. The authors propose that these methods do not overwrite original knowledge but rather suppress it by targeting a common subset of weights critical for maintaining edits. They introduce a compact binary mask that isolates these weights, demonstrating that it can reverse over 80% of edits on the training set and over 70% on unseen edits. The analysis reveals that the edits succeed by amplifying attention signals in later layers, rather than erasing original knowledge. Furthermore, injecting the mask during the editing process significantly reduces editing success, indicating that this mechanism is essential for the effectiveness of the edits. The findings suggest limitations in ROME and MEMIT, as they cannot propagate changes to related facts, and highlight pathways for detecting and defending against unwanted edits.
Methodology
The authors trained a compact binary mask over the edited weight matrices to isolate the subset of weights necessary for maintaining edits. They evaluated the effectiveness of this mask by measuring the reversal of edits on both training and test sets, and analyzed the impact of the mask on attention signals in later layers.
Results
The study found that a single mask could reverse over 80% of edits on the training set and over 70% on unseen edits, confirming a shared functional structure across diverse edits. Additionally, the success rate of edits dropped from 98% to 38% when the mask was injected prior to editing, highlighting the necessity of the identified mechanism for successful edits.
Implications
These findings have significant implications for understanding the limitations of current knowledge editing methods and for developing strategies to detect and mitigate unwanted edits in transformer models. They suggest a need for more robust mechanisms that can truly overwrite knowledge rather than merely suppress it.
Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning
Graph Learning
Multimodal
Efficient ML
- Introduction of BrainSimSiam, a self-supervised framework for fMRI representation learning.
- Outperforms traditional supervised and self-supervised models in various tasks.
- Utilizes positive-only data pairs to avoid the challenges of defining negative samples.
- Integrates voxel-wise and graph-based representations through a joint ROI masking scheme.
Read more
Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning
Summary
This paper presents BrainSimSiam, a self-supervised learning framework designed to extract robust and task-invariant functional representations from fMRI data. The authors address the challenges posed by small sample sizes, noisy labels, and high dimensionality in fMRI datasets, which often lead to model overfitting. By leveraging a lightweight Siamese architecture, BrainSimSiam utilizes positive-only data pairs to learn generalizable features without the need for negative samples, which are difficult to define in fMRI contexts. The framework demonstrates superior performance across various downstream classification and regression tasks compared to both supervised and existing self-supervised methods. The authors also introduce a joint region of interest (ROI) masking scheme that integrates voxel-wise fMRI data with graph-based functional views, enhancing interpretability and supporting multimodal fusion with structural MRI data. Overall, the study highlights the potential of self-supervised learning in neuroimaging applications, particularly in scenarios with limited data.
Methodology
The authors developed BrainSimSiam, a two-step self-supervised learning framework that employs a Siamese architecture to learn representations from task-based fMRI data. The framework focuses on positive-only data pairs to enhance generalization and robustness. A joint ROI masking scheme is applied during training to unify voxel-wise and graph-based representations, facilitating multimodal integration.
Results
The experiments showed that representations learned by BrainSimSiam achieved strong performance across multiple classification and regression tasks, surpassing fully supervised baselines and approaching the performance of larger models. The framework demonstrated its effectiveness in extracting task-invariant features from diverse fMRI tasks.
Implications
The findings suggest that self-supervised learning can significantly improve the robustness and generalizability of models in neuroimaging, particularly in contexts where data is scarce or labels are noisy. This approach could pave the way for more effective analysis of neurological and psychiatric disorders using fMRI.
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Optimization
Theory
- Introduction of Singularity-aware Adam (S-Adam) optimizer to handle non-smooth optimization challenges.
- Development of the Local Geometric Instability (LGI) metric for estimating local instability in loss landscapes.
- Adaptive damping mechanism that adjusts step sizes based on geometric instability.
- Rigorous convergence guarantees to Clarke stationary points with optimal rates.
Read more
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Summary
This paper addresses the challenges posed by non-smooth loss landscapes in deep learning optimization, particularly due to components like ReLU activations and quantization operators that lead to gradient chattering in adaptive optimizers such as Adam. The authors introduce a novel optimizer called Singularity-aware Adam (S-Adam), which stabilizes training by dynamically adjusting step sizes based on local geometric instability. The key innovation is the Local Geometric Instability (LGI) metric, which estimates the diameter of the Clarke subdifferential using the variance of randomized directional derivatives. S-Adam employs an adaptive damping mechanism that slows down updates in high-instability regions while allowing for rapid convergence in smoother areas. The paper provides a rigorous convergence analysis, demonstrating that S-Adam converges almost surely to Clarke stationary points at an optimal rate of O(1/√T). Empirical evaluations show that S-Adam outperforms existing optimizers like AdamW and Prox-SGD in challenging scenarios such as Quantization-Aware Training (QAT) and high-noise small-batch learning, achieving significant accuracy improvements on datasets like CIFAR-100 and TinyImageNet.
Methodology
The authors propose S-Adam, which utilizes the LGI metric to assess local geometric instability without requiring Hessian computations. The optimizer incorporates an adaptive damping mechanism that modulates step sizes in real-time, allowing for stable updates in high-instability regions while maintaining fast convergence in smoother areas. A convergence analysis is conducted using differential inclusions to establish theoretical guarantees.
Results
Empirical evaluations demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy improvements of up to +6% on CIFAR-100 and +3% on TinyImageNet. The optimizer effectively mitigates gradient oscillations and enhances convergence stability in non-smooth optimization scenarios.
Implications
The proposed S-Adam optimizer offers a robust solution for training deep learning models with non-smooth loss landscapes, making it particularly useful in applications involving quantization and high-noise environments. Its theoretical foundations and empirical performance suggest it could serve as a drop-in replacement for existing adaptive optimizers in various deep learning tasks.
KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs
Computer Vision
Efficient ML
- KLAS improves accuracy-efficiency tradeoffs in stitched neural networks by leveraging KL divergence.
- The framework automates stitch selection, overcoming the limitations of heuristic-based approaches.
- KLAS achieves up to 1.21% higher accuracy or 1.33× reduction in FLOPs compared to existing methods.
- The method is applicable across various model families, including vision transformers and CNNs.
Read more
KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs
Summary
The paper introduces KLAS, a novel framework for stitching pretrained neural networks to optimize the accuracy-efficiency tradeoff. Existing stitching methods often rely on heuristic approaches that yield suboptimal results and lack generalizability. KLAS addresses these limitations by employing Kullback-Leibler (KL) divergence to measure the similarity between intermediate representations of different models, allowing for automated and principled stitch selection. The framework identifies promising stitch configurations from a vast number of possibilities, enhancing the accuracy-efficiency curve of stitched models without incurring additional finetuning costs. Comprehensive experiments demonstrate that KLAS consistently outperforms existing methods, achieving higher accuracy or reduced computational costs across various model families and datasets, including ImageNet-1K and CIFAR-100. The findings suggest that KLAS not only improves model performance but also offers a flexible solution for deploying models in diverse computational environments.
Methodology
KLAS utilizes KL divergence to evaluate the similarity between intermediate activations of pretrained models, allowing for the automatic selection of optimal stitch configurations. This method replaces heuristic-based selection with a principled approach, significantly enhancing the accuracy-efficiency tradeoff of the resulting stitched networks.
Results
KLAS was tested on ImageNet-1K and CIFAR-100 datasets, showing improvements of up to 1.21% in top-1 accuracy at the same computational cost or a 1.33× reduction in FLOPs while maintaining accuracy. The framework consistently outperformed existing stitching methods across multiple model families.
Implications
The KLAS framework provides a robust solution for optimizing neural network deployment in resource-constrained environments, enabling better performance without sacrificing efficiency. Its principles can be applied to various domains, including computer vision and potentially large language models, enhancing the flexibility of model deployment strategies.
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Reinforcement Learning
Theory
Optimization
- Distributional RL objectives are smoother than expectation-based objectives in chaotic systems.
- Return distributions are Lipschitz continuous in the 1-Wasserstein metric, even with diverging trajectories.
- Empirical analysis shows that distributional objectives lead to lower variance and better optimization in chaotic environments.
- Distributional Q-learning methods outperform non-distributional approaches in specific chaotic control tasks.
Read more
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Summary
This paper addresses the challenges posed by chaotic dynamical systems in the context of Reinforcement Learning (RL). Traditional RL methods, which optimize expected returns through scalar value functions, struggle in chaotic environments due to the exponential sensitivity to initial conditions. This leads to high-variance bootstrapped targets and poorly conditioned gradient updates. The authors propose that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured using the 1-Wasserstein metric. This observation allows for a smoother distributional Bellman objective, which aligns optimization with the structure of chaotic dynamics. The paper provides a theoretical foundation for the advantages of distributional RL methods in chaotic systems and empirically demonstrates that these methods yield smoother loss landscapes and lower variance targets, ultimately improving performance in chaotic control tasks.
Methodology
The authors analyze the optimization landscape of chaotic systems, focusing on the regularity of return distributions compared to trajectory-level quantities. They utilize the 1-Wasserstein metric to demonstrate the smoother nature of distributional RL objectives. Empirical experiments are conducted to compare distributional Q-learning methods against traditional non-distributional approaches in chaotic control scenarios.
Results
The study finds that distributional RL methods produce smoother loss landscapes and lower variance one-step targets, which leads to improved episodic returns in chaotic environments. The theoretical analysis confirms that the distributional RL objective is more stable and effective for learning in chaotic systems compared to traditional scalar value function approaches.
Implications
The findings suggest that adopting distributional RL methods can enhance learning and control in various applications involving chaotic dynamical systems, such as multi-agent systems, climate modeling, and financial system stabilization. This work provides a theoretical justification for the empirical successes of distributional RL in chaotic contexts.
TRACER: Persistent Regularization for Robust Multimodal Finetuning
Multimodal
Theory
Computer Vision
- Introduces a theoretical framework for multimodal contrastive finetuning with closed-form solutions.
- Identifies the collapse issue of EMA teachers in robust finetuning and proposes WMA teachers as a solution.
- Develops TRACER, a method that combines contrastive learning with WMA-guided distillation.
- Demonstrates consistent improvements in OOD accuracy and calibration across multiple CLIP architectures.
Read more
TRACER: Persistent Regularization for Robust Multimodal Finetuning
Summary
The paper addresses the challenge of catastrophic forgetting in the finetuning of pretrained multimodal models, which often leads to a degradation of out-of-distribution (OOD) robustness. The authors propose a theoretical framework for multimodal contrastive finetuning that provides closed-form solutions and a geometric decomposition of various strategies. They highlight the limitations of standard Exponential Moving Average (EMA) teachers, which can suffer from collapse, and introduce a Weighted Moving Average (WMA) teacher that maintains a persistent regularizing force. This leads to the development of TRACER (Trajectory-Robust Anchoring for Contrastive Encoder Regularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate that TRACER consistently improves OOD accuracy and calibration across multiple architectures, while also being robust to hyperparameter variations. The work contributes to a deeper understanding of the geometric behavior of finetuning strategies and offers a principled approach to mitigating catastrophic forgetting.
Methodology
The authors develop a theoretical framework that reformulates the linearized contrastive loss into a matrix least-squares problem, allowing for closed-form solutions. They analyze the geometric structure of finetuning strategies and propose TRACER, which integrates contrastive learning with dynamic self-distillation using WMA teachers to maintain regularization strength throughout training.
Results
TRACER shows significant improvements in OOD accuracy and calibration when fine-tuning CLIP models across three different architectures. The method is validated through extensive ablation studies, confirming its robustness to hyperparameter choices and the effectiveness of its design components.
Implications
The findings suggest that TRACER can be applied to enhance the robustness of multimodal models in various applications, particularly in scenarios where OOD performance is critical. This could benefit fields such as computer vision, natural language processing, and any domain relying on pretrained multimodal models.