AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
NLP
Large Language Models
Interpretability
- Failed reasoning traces encode recoverability structure that can guide effective interventions.
- Three trajectory features derived from failed traces help cluster failures and characterize post-training methods.
- A training-free routing rule based on these features improves recovery rates on challenging reasoning problems.
- The approach allows for diagnostic analysis without requiring access to training-time data or model weights.
Read more
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
Summary
This paper addresses the issue of how failed reasoning traces in language models can provide insights into the recoverability of failures during inference. The authors argue that traditional approaches to handling failures, such as retrying with additional compute, often overlook the structural nature of certain failures. They propose a new framework that treats failed traces as diagnostic objects, allowing for the identification of which test-time interventions can effectively rescue a given failure. The authors derive three key trajectory features from the structure of available interventions: deformation spread, junction concentration, and local displacement budget. These features enable the clustering of failures into stable regimes and characterize the failure landscape of various post-training methods. The paper demonstrates that a training-free routing rule based on these features can significantly improve recovery rates on challenging problems, specifically those categorized as 'Steerable-Hard', where standard retry methods are insufficient. The proposed approach not only enhances the understanding of failed traces but also provides actionable insights for deploying language models more effectively.
Methodology
The authors analyze failed reasoning traces to extract three trajectory features that reflect the structure of available interventions. They propose a routing rule that utilizes these features to determine the most effective recovery strategy for each failure type. The methodology is validated through experiments on various reasoning tasks, focusing on the 'Steerable-Hard' subset where standard retry methods fail.
Results
The proposed features and routing rule achieved an accuracy of 84.3±4.3%, which is a 20% improvement over a majority-class baseline. Additionally, the routing rule provided a 12.2% lift in recovery rates for the Steerable-Hard problems, demonstrating its effectiveness in directing compute resources more efficiently.
Implications
This research has significant implications for the deployment of language models in real-world applications, particularly in scenarios where reasoning capabilities are critical. By leveraging failed traces for diagnostic purposes, practitioners can optimize model performance and resource allocation during inference, leading to more robust AI systems.
A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
Theory
Optimization
- Introduces a geometric characterization of stationary plateaus in two-layer neural networks.
- Classifies stationary points on plateaus based on conditions of local minima and saddle points.
- Identifies the 'inner Hessian' matrix as a key factor in determining local geometry.
- Demonstrates how neuron splitting affects the nature of stationary points in wider networks.
Read more
A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
Summary
This paper investigates the geometric structure of stationary plateaus in the loss landscape of two-layer neural networks with smooth activation functions. The authors focus on the phenomenon of 'neuron splitting,' where duplicating a hidden neuron results in an affine set of stationary points in a wider network. They provide a comprehensive classification of stationary points on these plateaus, determining conditions under which they are local minima or saddle points. The analysis introduces the 'inner Hessian' matrix, a curvature object that, along with splitting coefficients, influences the local geometry of the plateau. The findings reveal that splitting a local minimum can lead to a mixture of local minima and saddles or an all-saddle plateau, while splitting a saddle point always results in a plateau of saddle points. This work extends previous landscape analyses and offers new insights into how model expansion affects the nature of stationary points in neural networks.
Methodology
The authors conducted a systematic study of the effects of neuron splitting on stationary points in two-layer neural networks. They introduced the inner Hessian matrix to analyze the curvature of stationary points and established conditions for classifying these points as local minima or saddle points based on geometric criteria.
Results
The study found that splitting a local minimum can produce either a mixed plateau of local minima and saddles or an all-saddle plateau, depending on local geometric conditions. Conversely, splitting a saddle point consistently results in a plateau of saddle points. A concrete region within the plateau was identified where any choice of coefficients leads to saddle points, establishing a clear evolution law for stationary points under network widening.
Implications
These findings provide a deeper understanding of the optimization landscape in neural networks, particularly how expanding network width influences the nature of stationary points. This could inform strategies for network design and optimization, potentially leading to improved training methods and performance in various applications.
QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
NLP
Large Language Models
Efficient ML
- QuBLAST introduces a block-level compression approach for mixed-precision quantization of LLMs.
- The activation scaling strategy effectively mitigates the impact of activation outliers.
- Experimental results show a significant reduction in model size while maintaining performance.
- QuBLAST is applicable to various LLM architectures, including those with non-conventional attention mechanisms.
Read more
QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
Summary
The paper presents QuBLAST, a novel post-training quantization (PTQ) methodology designed to optimize the deployment of large language models (LLMs) on embedded systems. Traditional quantization methods often apply uniform quantization across all attention blocks, which can lead to suboptimal memory savings and performance degradation. QuBLAST addresses these issues by employing a block-level compression approach that allows for mixed-precision quantization tailored to the sensitivity of different attention blocks. Additionally, it introduces an activation scaling strategy to effectively manage activation outliers, which can adversely affect quantization outcomes. The methodology includes a sensitivity analysis based on cross-entropy loss to determine appropriate weight quantization levels for each block. Experimental evaluations demonstrate that QuBLAST achieves a model size reduction of 40%-45.2% across various architectures, including Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B, while maintaining performance with only a 5% increase in perplexity on the WikiText-2 and WikiText-103 datasets. This indicates that QuBLAST is effective in quantizing diverse LLMs while preserving their performance, thus facilitating their use in resource-constrained environments.
Methodology
QuBLAST employs a post-training quantization methodology that includes a block-level compression approach for mixed-precision quantization and an activation scaling strategy to manage activation outliers. It conducts a sensitivity analysis of attention blocks using cross-entropy loss to determine optimal weight quantization levels, and it utilizes an activation scaling map for each block to control activation ranges.
Results
QuBLAST achieves a model size reduction of 40%-45.2% across multiple LLM architectures while ensuring that the performance, measured in terms of perplexity, only increases by 5% on the WikiText-2 and WikiText-103 datasets.
Implications
The findings suggest that QuBLAST can significantly enhance the deployment of large language models in embedded systems, making advanced NLP capabilities more accessible in resource-constrained environments. This could lead to broader applications of LLMs in personalized systems and other areas where computational efficiency is critical.
Two-Action Apple Tasting with Switching Costs
Theory
Optimization
- The two-action apple-tasting problem is analyzed with switching costs against an oblivious adversary.
- The expected minimax regret is established as Θ(√T), contradicting previous assumptions of a Ω(T^(2/3)) lower bound.
- A normalized formulation of the problem simplifies the analysis and allows for a more straightforward algorithm design.
- The proposed algorithm utilizes a simple alternating strategy between blind and inspection modes to achieve O(√T) regret.
Read more
Two-Action Apple Tasting with Switching Costs
Summary
This paper investigates the two-action apple-tasting problem under the influence of switching costs when facing an oblivious adversary. The problem involves a learner who must choose between a revealing action, which provides no reward but reveals the hidden value of a blind action, and a blind action, which yields a reward but provides no information. The learner incurs a cost each time they switch actions, and the goal is to minimize regret against the best fixed action in hindsight. The authors establish that the expected minimax regret for this problem is Θ(√T), disproving the conjecture that a lower bound of Ω(T^(2/3)) exists for this scenario. They reformulate the problem into a normalized version, allowing for a clearer analysis of regret. The proposed algorithm achieves O(√T) regret by alternating between blind and inspection modes, utilizing Bernoulli random variables to decide when to switch actions. This approach effectively balances the exploration of the revealing action with the exploitation of the blind action, leading to a significant reduction in regret compared to previous methods.
Methodology
The authors reformulate the apple-tasting problem into a normalized version, where the learner's actions are analyzed in terms of cumulative rewards. They develop an algorithm that alternates between blind and inspection modes, using independent Bernoulli random variables to determine when to switch actions. The analysis of regret is conducted by examining the contributions from completed inspection runs and the cumulative rewards from blind actions.
Results
The main result demonstrates that the expected minimax regret for the two-action apple-tasting problem with switching costs is Θ(√T). The proposed algorithm achieves O(√T) regret, which is a significant improvement over previous upper bounds of eO(T^(2/3)).
Implications
The results have important implications for the classification of feedback graphs in the presence of switching costs, as they clarify the limitations of existing theories and provide a foundation for future research in this area. Additionally, the findings may influence the design of algorithms in similar decision-making scenarios where switching costs are a factor.
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations
NLP
Large Language Models
Interpretability
- STRIDE models TDA as a sparse recovery problem in activation space, bypassing the need for retraining LLMs.
- The framework uses lightweight steering operators to mimic changes in model predictions caused by training data subsets.
- STRIDE outperforms existing TDA methods in terms of accuracy and computational efficiency.
- The approach facilitates practical applications such as data selection and contamination detection.
Read more
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations
Summary
The paper introduces STRIDE, a novel framework for Training Data Attribution (TDA) that addresses the challenges of tracing model predictions back to training data without the computational burden of retraining large language models (LLMs). Traditional TDA methods rely on causal interventions and gradient-based approximations, which are computationally expensive and often impractical for LLMs. STRIDE shifts the focus from parameter changes to modeling the functional effects of training data in the activation space. It formulates TDA as a sparse recovery problem, utilizing lightweight 'steering operators' that simulate the behavioral shifts caused by training on data subsets. By applying these operators to test predictions, STRIDE effectively recovers the influences of individual training examples through sparse linear decomposition. The framework demonstrates significant improvements in efficiency and accuracy, achieving the highest Linear Datamodeling Score (LDS) compared to existing methods while being over 12 times faster. Additionally, STRIDE's practical utility is validated through applications in data selection, contamination detection, and qualitative analysis, highlighting its potential for enhancing model interpretability and dataset curation.
Methodology
STRIDE operates in two stages: first, it learns low-rank steering operators that approximate the output changes of a model trained on specific subsets of data. Second, during inference, these operators are applied to test queries to generate perturbation response vectors, which are then used to recover individual training influences through sparse recovery techniques.
Results
STRIDE achieves the highest Linear Datamodeling Score (LDS) among competing methods, demonstrating over 12 times faster performance. The framework's effectiveness is validated through various downstream applications, confirming its potential to improve model interpretability and data quality.
Implications
The development of STRIDE has significant implications for auditing model behavior, detecting data memorization, debugging harmful outputs, and curating high-quality datasets for future training. Its efficiency and accuracy make it a valuable tool for researchers and practitioners working with large language models.
Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification
Graph Learning
Multimodal
Interpretability
- Introduction of GTAD framework that combines diffusion processes with transformer-guided attention.
- Improved classification performance for preclinical Alzheimer's disease using multi-modal imaging data.
- Enhanced interpretability in identifying critical brain regions associated with Alzheimer's disease.
- Demonstration of the model's effectiveness on structural brain networks from the ADNI study.
Read more
Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification
Summary
This paper presents a novel framework called Graph Neural Network with Transformer-guided Adaptive Diffusion (GTAD) aimed at improving the classification of preclinical Alzheimer's disease (AD) using multi-modal brain imaging data. The authors address limitations in existing Graph Neural Networks (GNNs) that struggle with aggregating information from distant nodes and retaining critical characteristics from pivotal nodes. The proposed GTAD framework integrates a diffusion process guided by a transformer, allowing for the aggregation of both short- and long-range properties of brain graphs. The model effectively captures local and global graph characteristics, enhancing the interpretability of the results, particularly in identifying key regions of interest (ROIs) associated with early AD symptoms. Experimental results demonstrate that GTAD outperforms state-of-the-art methods in preclinical AD classification, showcasing its potential for early diagnosis and intervention.
Methodology
The GTAD framework utilizes a combination of adaptive convolution blocks and multi-head self-attention mechanisms. It first generates locally effective representations of each node based on various imaging modalities using a heat-kernel approach. These representations are then aggregated through a transformer to achieve a globally effective representation for classification. The model learns node-centric parameters for the diffusion kernel, allowing for adaptive scaling based on the importance of each modality.
Results
The GTAD model significantly improves classification accuracy for preclinical Alzheimer's disease compared to existing GNN methods. It successfully identifies key ROIs that correlate with early signs of the disease, providing insights into the underlying brain network alterations associated with AD.
Implications
The findings suggest that the GTAD framework could be instrumental in developing tools for early diagnosis and monitoring of Alzheimer's disease, potentially leading to better patient outcomes through timely interventions. The model's interpretability also aids in understanding the complex relationships within brain networks, which is crucial for neuroimaging studies.
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models
Large Language Models
Graph Learning
Interpretability
- Graph sink tokens emerge as activation-level outliers but do not effectively convey graph information.
- High activation levels of graph tokens do not correlate with their importance for downstream tasks.
- Current GLMs exhibit a decoupling between token saliency and semantic utility, indicating potential architectural limitations.
Read more
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models
Summary
This paper investigates the internal processing of graph tokens in Graph Language Models (GLMs) and their effectiveness in conveying graph structure. The authors analyze how GLMs, which adapt Large Language Models (LLMs) for graph learning tasks, interpret graph tokens and whether these tokens serve as effective carriers of graph information. The study reveals that graph sink tokens, identified by their high activation levels in specific hidden-state dimensions, do not correlate with meaningful graph information utilization. Despite their prominence, these tokens do not attract significant attention from query tokens, and targeted interventions show that they contribute little to downstream performance. The findings indicate a decoupling between activation-level saliency and graph-semantic utility, suggesting limitations in current graph-token construction and alignment mechanisms.
Methodology
The authors conducted a mechanistic analysis of two GLM architectures (LLaGA and TEA-GLM) to study graph token behavior. They defined graph sink tokens based on activation patterns and performed targeted interventions (pruning, repositioning, and swapping) to assess the impact of these tokens on model performance.
Results
The analysis showed that graph sink tokens are biased towards early positions in the token sequence and are characterized by high activation in a limited number of hidden dimensions. However, these tokens do not dominate attention mechanisms and contribute minimally to the model's predictive capabilities, revealing a disconnect between their activation prominence and their semantic relevance.
Implications
The findings suggest that existing GLMs may not effectively leverage graph structures in their internal representations, which could hinder their performance in graph-related tasks. This highlights the need for improved methods in graph-token construction and alignment to enhance the utility of GLMs in processing graph data.
AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret
Time Series
Theory
Optimization
- Introduces AdaWeather, a framework combining probabilistic weather forecasts adaptively.
- Achieves logarithmic regret compared to the best static mixture of experts.
- Utilizes a U-Net model for historical pattern learning to enhance forecast accuracy.
- Demonstrates improved performance in temperature forecasting over existing methods.
Read more
AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret
Summary
The paper introduces AdaWeather, an adaptive framework designed to combine multiple probabilistic weather forecasts to enhance prediction accuracy and robustness. Traditional forecasting methods often struggle with consistent performance across various contexts, motivating the need for adaptive combination techniques. AdaWeather integrates machine learning with a mixture of experts approach, allowing it to dynamically adjust forecasts based on historical data. The authors extend existing regret analysis by demonstrating that their method achieves logarithmic regret compared to the best static mixture of experts, rather than just the best single expert. The framework employs a U-Net model trained on historical weather data to provide side-information for the aggregation process, leading to improved forecasting results, particularly in temperature predictions. The paper also discusses the theoretical underpinnings of the method, including a novel regret bound that aligns with optimal minimax bounds under specific conditions. Empirical evaluations show that AdaWeather outperforms existing forecasting methods, highlighting its potential for practical applications in various fields reliant on accurate weather predictions.
Methodology
AdaWeather combines offline and online methods by training a U-Net model on historical weather data to learn patterns. This model is then incorporated as an expert in an online aggregation algorithm that dynamically adjusts predictions based on incoming data. The framework is designed to handle distribution shifts and improve robustness in weather forecasting.
Results
Empirical results indicate that AdaWeather significantly outperforms existing probabilistic weather forecasting methods, particularly in temperature predictions. The framework's theoretical guarantees on regret provide a strong foundation for its effectiveness in adaptive forecasting.
Implications
The AdaWeather framework has significant implications for fields such as agriculture, energy management, and disaster response, where accurate weather forecasting is critical. Its adaptive nature allows for better handling of changing conditions, potentially leading to more reliable decision-making based on weather predictions.
A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
Reinforcement Learning
Theory
Efficient ML
- Formalization of a representational limitation in the BTA framework, reducing training costs from O(log2 |G|) to constant.
- Introduction of a new method for task composition that relies on goal sets and requires only array lookups.
- Empirical validation showing that additional base tasks do not enhance performance upon convergence.
- Identification of limitations in deterministic BTA composition when applied to stochastic MDPs, necessitating consideration of exponentially many policies.
Read more
A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
Summary
This paper revisits the Boolean Task Algebra (BTA) framework for zero-shot task composition in reinforcement learning. The authors identify a structural simplification in the space of optimal extended Q-value functions, demonstrating that in deterministic Markov Decision Processes (MDPs), these functions can be fully determined by just two tasks: the universal task and the empty task. This insight leads to a new goal-set-based composition method that allows for logical operations on goal sets, significantly reducing the learning costs and composition time compared to the original BTA formulation. The proposed method eliminates the need for a logarithmic set of base tasks, instead relying on array lookups to reconstruct composed value functions. Empirical evaluations across various environments show that adding more base tasks does not improve performance, and the authors also explore the limitations of this approach in stochastic settings, where optimal composition may require considering an exponential number of policies. Overall, the paper presents a more efficient and simplified approach to task composition in reinforcement learning, while also highlighting the challenges posed by stochastic environments.
Methodology
The authors formalize the collapse in the BTA framework by demonstrating that optimal extended Q-value functions can be derived from just two components: the universal and empty tasks. They propose a goal-set-based composition method that allows for efficient logical operations on goal sets, reconstructing composed value functions through simple array lookups rather than complex operations on learned value functions. Empirical evaluations are conducted across various environments to validate the proposed method.
Results
The results indicate that the new composition method significantly reduces learning costs and composition time without sacrificing policy performance. The empirical studies confirm that learning additional base tasks does not lead to better performance, and the authors provide a counterexample illustrating the limitations of deterministic BTA-style composition in stochastic MDPs, where optimal policies may grow exponentially with the number of goals.
Implications
This work has significant implications for the design of reinforcement learning agents, particularly in environments where task composition is essential. The findings suggest that a more efficient approach to task composition can lead to faster learning and better resource utilization, while also highlighting the need for careful consideration of stochastic environments in future research.
Edge of Stability Selectively Shapes Learning Across the Data Distribution
Optimization
Theory
- EoS is a selective mechanism that redistributes learning across the training data distribution.
- Two necessary conditions for benefiting from EoS are gradient alignment with the top Hessian eigenvector and sustained gradient magnitude.
- Controlled perturbations can isolate the effects of alignment and persistence on learning outcomes.
- The geometric composition of the training data influences which subsets benefit from EoS, affecting generalization behavior.
Read more
Edge of Stability Selectively Shapes Learning Across the Data Distribution
Summary
This paper investigates the edge of stability (EoS) in deep learning, revealing that it is not merely a global property of optimization but a selective mechanism that redistributes learning across different subsets of the training data. The authors demonstrate that the stability constraint can amplify learning for certain groups while suppressing it for others. They identify two critical conditions for a group to benefit from EoS: alignment of the aggregate gradient with the top Hessian eigenvector and the maintenance of non-vanishing gradient magnitude over time. Through controlled perturbations, the authors show that disrupting alignment or gradient persistence diminishes the EoS advantage. Their findings suggest that the geometry of the training data influences which subsets benefit from EoS, with implications for generalization and robustness in machine learning models. The paper contributes to the understanding of how optimization dynamics affect learning across data distributions, challenging classical intuitions about stability and curvature in deep learning.
Methodology
The authors employed a branching intervention technique to enter or exit the EoS regime from a shared training trajectory. They conducted controlled perturbations to isolate the effects of gradient alignment and persistence on learning. The study involved experiments using multi-layer perceptrons (MLPs) on the CIFAR-10 dataset to analyze the impact of EoS on different subsets of the training data.
Results
The results indicate that EoS selectively benefits subsets of the training data whose gradients align with the top Hessian eigenvector and maintain non-vanishing magnitudes. The experiments showed that varying the geometric composition of the training distribution shifts which subset benefits from EoS, leading to improved adversarial robustness and out-of-distribution generalization depending on the proximity of examples to the decision boundary.
Implications
These findings have significant implications for the design of training regimes in deep learning, suggesting that understanding the geometry of the training data and the dynamics of EoS can lead to better optimization strategies and improved model performance in terms of generalization and robustness.
Learning Empirically Admissible Neural Heuristics for Combinatorial Search
Reinforcement Learning
Optimization
- Introduces a framework for learning admissible neural heuristics that guarantees path optimality.
- Utilizes an underestimating Admissible Bellman Operator and an Asymmetric Loss function to prevent overestimations.
- Implements a post-hoc calibration safety offset to ensure empirical admissibility.
- Achieves significant reductions in search node expansions across various combinatorial puzzles.
Read more
Learning Empirically Admissible Neural Heuristics for Combinatorial Search
Summary
This paper addresses the challenge of finding optimal solution paths for combinatorial puzzles such as the Rubik's Cube and Lights Out, which are traditionally solved using heuristic search algorithms like A*. The authors highlight that these algorithms require admissible heuristics that do not overestimate the true cost-to-go. They propose a novel framework for learning validation-calibrated admissible neural heuristics that combines an underestimating Admissible Bellman Operator with an Asymmetric Loss function to penalize overestimations. Additionally, a post-hoc calibration safety offset is introduced to account for residual errors in neural function approximation. The framework is evaluated across three combinatorial domains, demonstrating that the calibrated heuristics maintain admissibility and optimality while significantly reducing search node expansions compared to standard analytical methods.
Methodology
The methodology involves formulating combinatorial puzzles as deterministic, discrete-time Markov Decision Processes (MDPs). The authors employ an underestimating Admissible Bellman Operator to ensure that the neural network targets remain bounded below the true optimal cost-to-go. They also apply an Asymmetric Loss function to skew predictions towards underestimation and introduce a post-hoc calibration safety offset to mitigate local approximation errors.
Results
The proposed calibrated neural heuristics demonstrated no admissibility violations during evaluation and preserved path optimality. The framework achieved reductions in search node expansions by up to 83.0% on a 2 × 2 Rubik’s Cube, 19.9% on a 3 × 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.
Implications
The findings suggest that the proposed framework can significantly enhance the efficiency of heuristic search algorithms in solving complex combinatorial problems, making it applicable in various fields such as robotics, logistics, and bioinformatics where optimal pathfinding is crucial.
Reinforcement Learning from Rich Feedback with Distributional DAgger
Reinforcement Learning
Large Language Models
Theory
- DistIL leverages rich feedback beyond binary correctness, improving credit assignment in reinforcement learning.
- The proposed method guarantees monotonic policy improvement, addressing limitations of existing self-distillation techniques.
- Empirical results indicate significant performance gains across diverse domains compared to traditional RLVR and self-distillation methods.
Read more
Reinforcement Learning from Rich Feedback with Distributional DAgger
Summary
This paper addresses the limitations of traditional reinforcement learning from verifiable rewards (RLVR), which typically relies on binary feedback for model training. The authors propose a novel approach called DistIL, a distributional variant of the DAgger imitation learning algorithm, which utilizes richer forms of feedback such as execution traces and expert corrections. DistIL optimizes a forward cross-entropy loss that allows for effective credit assignment by propagating future expert-student disagreements back to earlier decisions. The authors demonstrate that existing self-distillation methods based on f-divergences, such as reverse KL and Jensen-Shannon divergence, do not guarantee monotonic policy improvement and can lead to suboptimal actions. In contrast, DistIL ensures monotonic improvement and provides theoretical guarantees on regret. Empirical results show that DistIL outperforms RLVR and self-distillation baselines across various domains, including scientific reasoning, coding, and complex mathematical problem-solving.
Methodology
The authors introduce DistIL, which optimizes a forward cross-entropy loss between the feedback-conditioned teacher and the student policy. This method allows for black-box teacher integration and employs future-aware credit assignment to enhance learning efficiency.
Results
DistIL consistently outperformed RLVR and self-distillation baselines in validation performance across multiple scientific reasoning domains, demonstrating higher stability and improved metrics such as Pass@N.
Implications
The findings suggest that incorporating richer feedback mechanisms can significantly enhance the performance of reinforcement learning models, particularly in complex reasoning tasks where traditional methods struggle. This approach could be applied in various fields requiring nuanced decision-making and learning from expert feedback.
Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning
Reinforcement Learning
Time Series
Optimization
- Introduction of a hybrid trading strategy combining statistical arbitrage with DRL execution policies.
- Development of a hierarchical pair selection methodology to isolate high-conviction anomalies.
- Demonstration of significant outperformance of the DRL-enhanced strategy over traditional heuristics.
- Establishment of a framework for safe reinforcement learning via deterministic shielding.
Read more
Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning
Summary
This study investigates the application of Deep Reinforcement Learning (DRL) to enhance pair trading strategies in the volatile cryptocurrency markets. Traditional pair trading methods, while effective in equities, face challenges in high-variance environments due to rigidity and divergence risks. The authors propose a novel Dynamic Multi-Pair Trading Strategy that incorporates a hierarchical 'Filter-then-Rank' pair selection methodology and a 'Fixed Risk, Adaptive Mean' execution model. A Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer is employed to make execution decisions while adhering to strict risk management protocols. The strategy is evaluated using 1-hour interval data from the Binance USD-M Futures market, demonstrating significant outperformance compared to heuristic baselines. The results indicate that the hybrid model effectively captures mean-reverting regimes and adapts to market changes, yielding superior risk-adjusted returns. The study contributes to quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies, providing a framework for safer reinforcement learning in high-noise environments.
Methodology
The methodology involves a hierarchical 'Filter-then-Rank' pair selection framework and a 'Fixed Risk, Adaptive Mean' execution model. A PPO agent with an LSTM architecture is utilized for execution decisions, operating within strict risk management boundaries. The empirical analysis is conducted on 1-hour interval data from the Binance USD-M Futures market, with a focus on Out-Of-Sample performance evaluation.
Results
The optimized RL policy achieved substantial Out-Of-Sample performance, significantly outperforming the heuristic baseline. A robustness check confirmed that the agent's risk-adjusted outperformance is statistically significant at the 10% level, indicating the effectiveness of the proposed strategy despite the high volatility of cryptocurrency markets.
Implications
The findings suggest that integrating DRL into pair trading strategies can enhance profitability and risk management in cryptocurrency markets. The proposed framework may serve as a foundation for future research and applications in algorithmic trading, particularly in high-noise environments.
ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL
Reinforcement Learning
Robotics
- Introduces a dual-encoder architecture for factorized representations of dynamics and goals.
- Utilizes a dual contrastive objective to enhance reward transfer in IRL.
- Demonstrates effective few-shot transfer capabilities to unseen dynamics-goal combinations.
- Empirical results show significant improvements over traditional IRL methods.
Read more
ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL
Summary
The paper presents ConTraIRL, a novel framework aimed at improving reward transfer in Inverse Reinforcement Learning (IRL) by addressing the challenge of compositional generalization. Traditional IRL methods struggle to generalize to unseen combinations of environment dynamics and task goals, leading to unreliable reward recovery. ConTraIRL introduces a dual-encoder architecture that learns decoupled latent representations for dynamics and goals through a dual contrastive objective. This factorization allows for effective reward inference in scenarios where dynamics and goals are recombined in novel ways. The framework incorporates few-shot supervision, utilizing partial expert states from target environments to anchor reward recovery. Empirical evaluations on continuous control benchmarks demonstrate that ConTraIRL significantly outperforms existing transfer IRL baselines, showcasing improved sample efficiency and robustness in reward recovery under unseen dynamics-goal pairings.
Methodology
ConTraIRL employs a dual-encoder architecture that separately encodes dynamics and goals into distinct latent spaces. The training process involves a dual contrastive objective that encourages the dynamics encoder to learn goal-invariant structures while the goal encoder captures dynamics-invariant features. Temporal alignment is also incorporated to ensure that the representations reflect comparable progress within behaviors. The framework is trained across multiple environments with partial expert states available for target contexts, facilitating reward recovery in unseen dynamics-goal pairings.
Results
The experiments conducted on MuJoCo benchmarks reveal that ConTraIRL consistently outperforms baseline methods in terms of reward recovery and transfer robustness when faced with unseen combinations of dynamics and goals. The results highlight the framework's ability to improve sample efficiency and effectively leverage few-shot supervision.
Implications
ConTraIRL has potential applications in robotics and other domains where agents need to adapt to new environments with varying dynamics and goals. The ability to generalize rewards across different contexts can enhance the efficiency of training autonomous systems and improve their adaptability to real-world scenarios.
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats
NLP
Large Language Models
Efficient ML
- Introduces a differentiable framework for mixed-precision quantization of floating-point formats.
- Utilizes continuous optimization to avoid abrupt transitions in bit-width assignments.
- Implements a temperature-based annealing mechanism for discretizing learned bit-widths.
- Achieves superior performance over traditional layer-selection heuristics in LLMs.
Read more
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats
Summary
The paper introduces dMX, a differentiable mixed-precision quantization framework designed for optimizing the bit-width assignment of low-precision floating-point formats in large language models (LLMs). Traditional quantization methods often apply a uniform bit-width across all layers, which can lead to suboptimal performance and accuracy. dMX addresses this by formulating the per-layer bit-width assignment as a continuous optimization problem, allowing for a smoother optimization landscape. The framework utilizes a temperature-based annealing schedule to progressively discretize the learned offsets, ensuring compatibility with hardware-specific formats while minimizing abrupt transitions between training and inference. A target-aware regularization term is included to guide the average bit-width towards a user-defined budget, balancing model quality with deployment efficiency. Experiments conducted on various LLMs, including Llama and Qwen3, demonstrate that dMX consistently outperforms existing layer-selection heuristics, achieving better trade-offs between model quality and average bit-width.
Methodology
The dMX framework employs a gradient-based optimization approach to learn per-layer floating-point bit-widths. It parameterizes the bit-widths as continuous values, which are then mapped to discrete formats through a temperature-based annealing process. This allows for smooth transitions during training and ensures that the final configurations are compatible with hardware requirements. A regularization term is included to manage the average bit-width according to a specified budget.
Results
The experiments show that dMX yields Pareto-dominating models across various LLMs, improving perplexity on WikiText-2 and accuracy on multiple zero-shot reasoning benchmarks. The framework outperforms Kullback-Leibler divergence-based heuristics, effectively navigating the trade-offs between model quality and average bit-width.
Implications
The dMX framework has significant implications for the efficient deployment of large language models, enabling better resource utilization while maintaining model performance. Its compatibility with existing quantization methods suggests potential for broader applications in optimizing model architectures for various hardware environments.
Multi-component Causal Tracing in Large Language Models
NLP
Large Language Models
Interpretability
- Introduces a unified framework for multi-component causal tracing in LLMs.
- Develops an efficient algorithm that converts combinatorial search into a continuous optimization problem.
- Demonstrates the identification of critical model components that influence performance metrics.
- Highlights the non-linear interactions among components, challenging previous linear assumptions.
Read more
Multi-component Causal Tracing in Large Language Models
Summary
This paper introduces a unified framework for multi-component causal tracing in large language models (LLMs), addressing the limitations of previous studies that focused on single components. The proposed framework systematically identifies critical subsets of model components, such as attention heads and multi-layer perceptron neurons, that significantly impact performance metrics like accuracy and fairness. By employing flexible interventions and a novel algorithm that transforms the combinatorial search problem into a continuous one, the authors efficiently identify influential components. Experimental results demonstrate that this method outperforms existing baseline approaches in identifying components that enhance target metrics, revealing the non-linear interactions among multiple components that traditional methods may overlook. The findings underscore the importance of a multi-component perspective in causal tracing, providing insights into the underlying mechanisms of LLMs and their behavior.
Methodology
The authors propose a framework for causal tracing that allows for systematic interventions on multiple components of LLMs. They design an efficient algorithm that leverages soft interventions and metric transformations to address the combinatorial complexity of selecting components. This approach enables the identification of subsets of components that collectively maximize a chosen performance metric.
Results
The experimental results indicate that the proposed multi-component causal tracing method effectively identifies subsets of model components with significant impacts on target metrics, outperforming existing baseline approaches. The results also reveal pronounced non-linear effects when multiple components are jointly intervened upon, suggesting that interactions among components are crucial for understanding LLM behavior.
Implications
The findings have significant implications for improving the interpretability and performance of LLMs. By systematically analyzing the causal pathways within these models, researchers can better understand and mitigate issues such as bias and factual inaccuracy, leading to safer and more reliable applications of LLMs in various domains.
Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift
Theory
- Introduces a framework for analyzing generalization bounds under regime-switching environments.
- Quantifies the risk due to regime composition mismatch using a two-state Markov process.
- Extends generalization bounds to beta-mixing data with effective sample size considerations.
- Empirical results show strong correlation between the proposed penalty and deployment gaps.
Read more
Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift
Summary
This paper addresses the limitations of standard generalization bounds in machine learning, which typically assume static training and deployment distributions. It introduces a framework that accounts for regime-switching environments, specifically focusing on the mismatch between training and deployment distributions characterized by a two-state Markov process (calm and crisis regimes). The author quantifies the additional risk incurred due to this regime composition mismatch and provides an exact decomposition that separates regime mismatch from regime sensitivity. The framework extends existing bounds to beta-mixing data, incorporating an effective sample size adjusted for the spectral gap. Empirical validation on synthetic data and 25 years of global equity indices demonstrates that the proposed penalty, derived from realized future crisis fractions, correlates with actual deployment gaps. However, the framework is not designed for forecasting future regime compositions, highlighting the need for improved forecasting methods in the context of regime changes.
Methodology
The paper develops a theoretical framework based on a two-state Markov process to model regime dynamics. It derives an exact decomposition of future risk related to regime mismatch and establishes upper and lower bounds for deployment risk. The analysis incorporates concepts from domain adaptation and dependent learning theory, utilizing H∆H-divergence for quantifying regime discrepancies.
Results
The framework successfully demonstrates that the mismatch between training and deployment regimes increases future deployment risk. The empirical validation shows a Spearman correlation of 0.729 between the penalty computed from realized future crisis fractions and actual train-to-deployment gaps, confirming the theoretical predictions.
Implications
The findings suggest that understanding regime dynamics is crucial for improving model performance in changing environments. The framework serves as a diagnostic tool for analyzing deployment failures rather than a forecasting system, emphasizing the need for better methods to predict future regime compositions.
Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots
Reinforcement Learning
Large Language Models
Efficient ML
- Introduces PivotTrace, a framework for efficient RLVR that selects unlabeled samples without prior supervision.
- Utilizes metacognitive pivots to quantify uncertainty and guide adaptive data routing.
- Achieves superior performance with significantly fewer annotated samples compared to fully supervised models.
- Demonstrates 2.75 times faster convergence in training processes.
Read more
Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots
Summary
This paper addresses the challenges of Reinforcement Learning with Verifiable Rewards (RLVR) in training large reasoning models (LRMs) efficiently. Traditional RLVR methods require extensive annotated datasets, which are costly and time-consuming to obtain. The authors propose a novel approach called PivotTrace, which operates under a 'pick in the dark' paradigm, allowing for the selection of unlabeled samples that are most beneficial for training without prior supervision. PivotTrace utilizes a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots—critical points in reasoning where the model reassesses its inferences. By quantifying uncertainty through pivot density, PivotTrace enables automated data routing, enhancing both annotation and training efficiency. Empirical results show that PivotTrace outperforms fully supervised models using only 29.3% of annotated samples and achieves 2.75 times faster convergence, demonstrating its effectiveness in improving RLVR efficiency.
Methodology
The authors developed PivotTrace, which employs a three-way data triage framework that identifies metacognitive pivots during reasoning. It uses peak detection on attention dynamics to derive pivot counts as a proxy for uncertainty and incorporates an automated threshold calibration module for optimal data partitioning. This allows for adaptive routing of data into different training pipelines, enhancing efficiency in both annotation and training.
Results
PivotTrace outperformed the strongest baseline by +1.6% in-domain and +2.4% out-of-domain average accuracy. It surpassed fully supervised training with only 29.3% of the dataset annotated and achieved 2.75 times faster training convergence, showcasing its effectiveness in RLVR.
Implications
The findings suggest that PivotTrace can significantly reduce the costs associated with data annotation and training in RLVR, making it applicable in domains where labeled data is scarce or expensive to obtain, such as healthcare and finance. This approach could lead to more efficient deployment of large reasoning models in real-world applications.
Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
Reinforcement Learning
Generative Models
Optimization
- Introduces a dual chance-constrained program (CCP) for safety guarantees in RL.
- Utilizes a variational autoencoder (VAE) to encode state-space distributions and barrier-certificates.
- Focuses on robust exploration to minimize the risk of encountering unsafe states.
- Demonstrates the effectiveness of the proposed method through experimental results.
Read more
Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
Summary
This paper addresses the critical challenge of ensuring safety in reinforcement learning (RL) agents, particularly in real-world applications where policies may fail to generalize to novel scenarios. The authors propose a novel approach to policy verification by constructing probabilistic barrier-certificates that delineate safe behavior from unsafe behavior through the sampling of policy trajectories. The method utilizes a variational autoencoder (VAE) to approximate the state-space distribution, allowing for the optimization of upper and lower-bound barrier-certificates. This dual optimization problem enables the identification of regions of known safe behavior with high confidence. The authors introduce a scenario-based approach to robust exploration, focusing on states that lie within the symmetric difference of the upper and lower bounds, thereby tightening the probabilistic guarantees on safety. The paper demonstrates the effectiveness of this approach through experimental validation, showing that the method can improve the robustness of RL policies by generating tentatively unsafe states for further training.
Methodology
The authors formulate a dual optimization problem to create upper and lower-bound barrier-certificates using sampled trajectories from the learned policy. A variational autoencoder (VAE) is employed to approximate the state-space distribution, enabling the identification of safe and unsafe regions. The approach emphasizes robust exploration by sampling states within the symmetric difference of the upper and lower bounds, thereby tightening safety guarantees.
Results
The experimental results indicate that the proposed method effectively narrows the difference between upper and lower bounds on safety constraint violations, leading to sharper probabilistic guarantees. The approach also successfully generates tentatively unsafe states that contribute to the robustness of the RL policy.
Implications
This work has significant implications for the deployment of RL agents in safety-critical domains such as healthcare, aviation, and robotics, where ensuring compliance with safety constraints is paramount. The methodology can enhance the reliability of RL systems in real-world applications by providing a framework for safe exploration and policy verification.
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Efficient ML
Large Language Models
Computer Vision
- Systematic evaluation of QKV projection-sharing strategies across diverse tasks.
- Q-K=V configuration reduces KV cache size by 50% with minimal performance degradation.
- Projection sharing is complementary to head sharing, enabling significant memory efficiency gains.
- Task-dependent efficacy of projection-sharing strategies observed.
Read more
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Summary
This paper investigates the necessity of the three separate projections (Query, Key, Value) in Transformer architectures, proposing and systematically evaluating three projection-sharing strategies: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). The authors conduct extensive experiments across various tasks, including synthetic reasoning, computer vision, and language modeling, to assess the performance and efficiency of these variants. The findings reveal that certain projection-sharing configurations can significantly reduce parameter counts and computational overhead while maintaining or even improving performance. Notably, the Q-K=V configuration achieves a 50% reduction in KV cache size with only a 3.1% increase in perplexity for large language models. The study also highlights the complementary nature of projection sharing with head sharing techniques, leading to substantial gains in memory efficiency. Overall, the research provides valuable insights into architectural efficiency in Transformers, suggesting that projection sharing is a promising avenue for enhancing model performance, particularly in edge deployment scenarios.
Methodology
The authors benchmarked three projection-sharing architectures across 12 tasks, including synthetic reasoning, computer vision datasets (MNIST, CIFAR, TinyImageNet), and large language model pre-training. They analyzed the performance of these architectures in terms of parameter count, computational efficiency, and memory usage, particularly focusing on KV cache reductions and perplexity metrics.
Results
The study found that the Q-K=V projection-sharing configuration maintained competitive performance with the standard QKV transformer, achieving a 50% reduction in KV cache size and only a 3.1% increase in perplexity for 300M-parameter models. At a larger scale (1.2B parameters), the Q-K=V configuration continued to perform well, and when combined with head sharing techniques, it led to cache reductions of up to 96.9%. The results indicate that projection sharing can effectively reduce computational overhead without sacrificing model quality.
Implications
The findings suggest that adopting projection-sharing strategies can enhance the efficiency of Transformer models, making them more suitable for real-time inference and deployment on edge devices. This research could influence future architectural designs in Transformers and promote further exploration of weight tying in attention mechanisms.
OpenRFM: Dissecting Relational In-Context Learning
Theory
Efficient ML
Graph Learning
- OpenRFM addresses the performance gap between open RFMs and commercial models by enhancing relational in-context learning.
- The dual-stage ICL architecture combines relation-level and batch-level learning to improve label coverage.
- A homophily-aware pre-training approach is introduced, mixing synthetic and real data for better model performance.
- OpenRFM shows a 30% improvement over the RT backbone and surpasses KumoRFMv1 in multiple tasks.
Read more
OpenRFM: Dissecting Relational In-Context Learning
Summary
The paper investigates the performance gap between open Relational Foundation Models (RFMs) and their commercial counterparts, focusing on the Relational Transformer (RT) framework. The authors identify two main issues: the model's reliance on relation-level in-context learning (ICL) and the limitations of its pre-training sources. They propose OpenRFM, which incorporates a dual-stage ICL architecture that combines relation-level and batch-level ICL to address label scarcity. Additionally, they introduce a homophily-aware pre-training strategy that mixes synthetic and continual real data, enhancing the model's ability to learn from relational structures. OpenRFM demonstrates a significant performance improvement, achieving approximately 30% better average task performance than the RT backbone and outperforming the commercial model KumoRFMv1 across various evaluation tasks.
Methodology
The authors dissect the RT framework to understand its limitations, focusing on relation-level ICL and the effects of pre-training sources. They propose a dual-stage ICL architecture that integrates a batch-level ICL layer from a pre-trained tabular foundation model. Furthermore, they develop a homophily-aware pre-training strategy that combines synthetic data generation with continual training on real-world databases, supported by prototype-based regularization.
Results
OpenRFM achieves an average task performance improvement of approximately 30% over the RT backbone and outperforms the commercial model KumoRFMv1 across a wide range of evaluation tasks, demonstrating its effectiveness in relational predictive modeling.
Implications
The findings suggest that enhancing relational in-context learning and improving pre-training strategies can significantly boost the performance of open RFMs, making them more competitive with commercial models. This has implications for various applications in predictive modeling across domains such as healthcare, finance, and recommendation systems.
Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing
Computer Vision
Theory
Efficient ML
- Proposes a two-stage sample scoring function to mitigate spurious correlations in datasets.
- Introduces the TCSL and TCSL-CS algorithms for effective sample selection.
- Demonstrates improved worst-group accuracy using only 10% of the original training data.
- Highlights the limitations of existing coreset selection methods in handling spurious features.
Read more
Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing
Summary
This paper addresses the challenge of spurious correlations in real-world datasets that can mislead machine learning models, particularly affecting the classification of minority samples. The authors propose a novel two-stage sample scoring function that disentangles the learning dynamics of core and spurious features, allowing for a more accurate evaluation of their difficulty. This approach leads to the development of the Two-Stage Cumulative Sample Loss (TCSL) and the TCSL-guided Coreset Selection (TCSL-CS) algorithm, which prioritizes informative samples regardless of the presence of spurious correlations. The experiments conducted demonstrate that models trained on samples selected using TCSL-CS achieve superior performance compared to existing debiasing techniques, while utilizing only 10% of the original training data. This work highlights the inadequacies of traditional coreset selection methods in spurious datasets and provides a promising solution that maintains standard empirical risk minimization (ERM) training.
Methodology
The authors developed a two-stage sample scoring function that separately evaluates core and spurious features. The TCSL algorithm computes scores for each feature type, which are then used in the TCSL-CS algorithm to select a coreset of informative samples. This method does not require access to group labels, making it applicable in scenarios where such labels are unavailable.
Results
The experiments showed that models trained on the coreset selected by the TCSL-CS algorithm achieved state-of-the-art worst-group accuracy on the Waterbirds dataset, outperforming existing debiasing methods while using only 10% of the total training data.
Implications
This research has significant implications for improving model robustness in real-world applications where spurious correlations are prevalent. It provides a framework for better sample selection that can enhance model performance across diverse groups, potentially leading to fairer and more accurate machine learning systems.
MAdam: Metric-Aware Multi-Objective Adam
Optimization
- Identification of systematic mismatches between MOO solvers and the Adam optimizer.
- Introduction of MAdam, a metric-aware optimization method that resolves these mismatches.
- Demonstrated improvements in performance across multiple application domains.
- MAdam maintains compatibility with existing MOO solvers and the Adam optimizer.
Read more
MAdam: Metric-Aware Multi-Objective Adam
Summary
The paper introduces MAdam, a novel optimization method designed to address systematic mismatches between multi-objective optimization (MOO) solvers and the Adam optimizer. The authors identify two primary issues: a weighting mismatch, where Adam's second-moment denominator conflates the time-varying preference vector with gradient statistics, and a geometric mismatch, where Adam's adaptive metric distorts the Euclidean geometry assumed by MOO solvers. MAdam serves as a drop-in wrapper that preconditions the reconciled direction using a preference-conditioned curvature of the scalarized objective, allowing Adam to operate with a more appropriate metric. The empirical validation demonstrates that MAdam consistently outperforms Adam across various applications, including multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, indicating its effectiveness in improving the performance of MOO solvers.
Methodology
MAdam operates by deriving the preference-conditioned diagonal Fisher information matrix of the scalarized objective and applying it as a preconditioner to the reconciled direction before the Adam update. This approach ensures that the first and second moments of Adam remain consistent while allowing the update to be governed by the appropriate preference-conditioned metric.
Results
The empirical results show that MAdam consistently improves upon the standard Adam optimizer across various multi-objective optimization tasks, including multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging applications, indicating its robustness and versatility.
Implications
MAdam's approach could significantly enhance the performance of multi-objective optimization in various fields, including machine learning, medical imaging, and any domain requiring the simultaneous optimization of multiple objectives. Its compatibility with existing solvers makes it a practical choice for researchers and practitioners.
Topology-Aware Gaussian Graph Repair for Robust Graph Neural Networks
Graph Learning
- TAGR improves GNN robustness against noisy and missing edges.
- The framework combines a Gaussian feature-neighborhood graph with a topology-aware residual correction.
- TAGR maintains compatibility with standard GNN architectures.
- The method is lightweight and avoids the complexities of dense graph structure learning.
Read more
Topology-Aware Gaussian Graph Repair for Robust Graph Neural Networks
Summary
The paper addresses the challenges faced by Graph Neural Networks (GNNs) due to imperfect graph topologies, which can include noisy and missing edges. Existing methods primarily focus on edge removal or learning new graph structures, which can be complex and may not effectively restore missing connections. The authors propose a novel framework called Topology-Aware Gaussian Repair (TAGR), which aims to enhance the robustness of GNNs by constructing a sparse feature-neighborhood graph using an adaptive Gaussian kernel. This approach introduces auxiliary edges between feature-similar nodes and applies a topology-aware residual correction to the observed graph, preserving and reweighting its original topology based on local feature and structural consistency. TAGR can be integrated directly with standard GNN architectures without requiring modifications. Experimental results on benchmark citation networks demonstrate that TAGR significantly improves GNN robustness in scenarios with noisy and missing edges. The findings indicate that the Gaussian feature-neighborhood repair is crucial for robustness, while the residual correction contributes to stability in incomplete graphs, suggesting that effective graph robustness can be achieved through lightweight sparse graph repair rather than complex dense graph structure learning.
Methodology
The TAGR framework constructs a sparse adjacency matrix before GNN training, utilizing an adaptive Gaussian kernel to create a feature-neighborhood graph. It introduces auxiliary edges based on feature similarity and applies a residual correction to adjust the weights of observed edges according to local feature agreement and structural consistency.
Results
Extensive experiments reveal that TAGR enhances the robustness of GNNs under both noisy-edge and missing-edge conditions. The Gaussian feature-neighborhood repair was found to be the primary contributor to robustness, while the topology-aware residual correction improved stability in incomplete graphs.
Implications
The proposed TAGR framework can be applied in various domains where graph data is prevalent, such as social networks, citation networks, and recommendation systems, providing a more reliable foundation for GNN applications in real-world scenarios.
GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
NLP
Large Language Models
Optimization
- GRZO reduces variance in zeroth-order optimization by utilizing multiple pseudo-independent perturbations per mini-batch.
- The method maintains inference-level memory efficiency while improving convergence rates for large language models.
- Experimental results show significant improvements in accuracy and memory usage compared to existing ZO methods like MeZO.
- GRZO can be integrated with other optimization techniques to further enhance performance.
Read more
GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Summary
The paper introduces GRZO, a novel Group-Relative Zeroth-Order (ZO) optimizer designed to enhance the fine-tuning of large language models (LLMs) while addressing the high variance associated with traditional ZO optimization methods. GRZO innovatively employs a single pseudo-independent perturbation per mini-batch example, which allows for the aggregation of per-example losses through group-relative normalization. This approach effectively increases the gradient-direction count from one to the batch size without incurring additional forward computation costs, thus maintaining memory efficiency at inference levels. The authors prove that GRZO is directionally unbiased and exhibits variance reduction proportional to the batch size, leading to a tighter nonconvex convergence bound compared to existing methods like MeZO. Experimental results demonstrate that GRZO significantly improves accuracy on various LLMs, particularly achieving a +3.0 increase in average accuracy on Llama3-8B while using 23% less peak GPU memory. Furthermore, GRZO serves as a drop-in replacement for MeZO, enhancing the performance of sparse, low-rank, and quantized ZO variants by an average of +6.0.
Methodology
GRZO employs a two-forward-pass strategy to draw multiple pseudo-independent perturbations for each mini-batch example. It aggregates the resulting losses using group-relative normalization, which allows for effective variance reduction without additional computational overhead. The theoretical framework establishes the unbiasedness of the method and quantifies the variance reduction in relation to batch size.
Results
GRZO outperformed MeZO and its variants across multiple tasks and language models, notably improving average accuracy on Llama3-8B by +3.0 while reducing peak GPU memory usage by 23%. As a drop-in replacement for MeZO, it also enhanced the performance of sparse, low-rank, and quantized ZO variants by +6.0 on average.
Implications
The development of GRZO has significant implications for the efficient fine-tuning of large language models, particularly in scenarios where memory constraints are critical. Its ability to reduce variance and improve convergence can lead to more effective training processes in various NLP applications.
Demystifying Pipeline Parallelism: First Theory for PipeDream
Theory
Optimization
Efficient ML
- Introduction of Randomized PipeDream (RPD) with a nonconvex convergence guarantee for pipeline model parallelism.
- Analysis of the scaling behavior of PipeDream, showing that delay grows quadratically with the number of stages.
- Comparison of PipeDream and LocalSGD, highlighting performance differences based on the training objective.
- Experimental results indicating that the choice of method depends on the specific task and scaling conditions.
Read more
Demystifying Pipeline Parallelism: First Theory for PipeDream
Summary
This paper addresses the challenges of pipeline model parallelism in training large-scale machine learning models, focusing on the PipeDream (PD) method. The authors introduce Randomized PipeDream (RPD), a theoretical framework that provides the first nonconvex convergence guarantees for PD-style methods. They analyze the scaling behavior of PD, demonstrating that the delay induced by steady-state PD grows quadratically with the number of stages, leading to a significant impact on convergence rates. Additionally, the paper compares PD with LocalSGD, revealing that the performance of each method varies depending on the specific objective and scaling regime. Through simulated-time experiments, the authors find that PD excels in quadratic objectives and language modeling tasks, while LocalSGD performs better in logistic regression as the number of stages increases. Overall, the paper contributes to a deeper understanding of pipeline parallelism and its optimization challenges, providing insights that can guide future developments in distributed training strategies.
Methodology
The authors develop a theoretical framework for Randomized PipeDream (RPD) to analyze the convergence behavior of pipeline parallelism. They derive convergence rates and scaling laws based on the number of stages in the pipeline. Simulated-time experiments are conducted to compare the performance of PD and LocalSGD across different objectives.
Results
The paper establishes that the convergence rate of PD is affected by the number of stages, with the stale-read contribution growing as Θ(γ2S4). In experiments, PD outperforms LocalSGD on quadratic objectives and language modeling tasks, while LocalSGD shows superior performance in logistic regression as the number of stages increases.
Implications
The findings provide a theoretical foundation for understanding the optimization behavior of pipeline parallelism, which can inform the design of more efficient distributed training algorithms. The insights gained may lead to improved training strategies for large-scale machine learning models, particularly in scenarios where model parallelism is necessary.
Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction
Multimodal
- Multi-modal integration of clinical data improves breast cancer recurrence prediction accuracy.
- The study employs a rule-based extraction mechanism to recover tumor characteristics from unstructured data.
- Performance is benchmarked against traditional single-source models, showing significant enhancements.
- Data harmonization addresses issues of fragmentation and inconsistency in electronic health records.
Read more
Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction
Summary
This paper addresses the challenge of predicting breast cancer recurrence, a significant factor in long-term mortality among survivors. Traditional predictive models often rely solely on structured data, which limits their ability to capture the full clinical context. The authors propose a multi-modal machine learning framework that integrates various clinical data sources, including treatment records, pathology reports, and clinician notes. By employing a rule-based regular expression extraction mechanism combined with a precedence-based conflict reconciliation strategy, the study effectively retrieves critical tumor characteristics from unstructured pathology narratives. The performance of this multi-modal approach is benchmarked against existing single-source models, demonstrating that the integration of diverse data types consistently enhances predictive accuracy. The research utilizes data from the University of Tennessee Medical Center and highlights the importance of harmonizing fragmented clinical data to improve the completeness and reliability of predictive models. Overall, the findings suggest that multi-modal integration can significantly advance breast cancer recurrence prediction, offering a more nuanced understanding of patient risk.
Methodology
The authors developed a multi-modal data harmonization framework that integrates treatment summaries, registry abstracts, and pathology reports. They utilized rule-based regular expression extractors to extract prognostic variables from unstructured text and applied precedence-based logic to resolve discrepancies across data sources. Various machine learning models were evaluated to assess the predictive performance of the integrated data.
Results
The results indicate that the multi-modal approach consistently outperforms single-modal methods in predictive accuracy for breast cancer recurrence. The integration of structured and unstructured data sources leads to improved data completeness and enhances the model's ability to capture critical clinical nuances.
Implications
The findings suggest that adopting multi-modal machine learning frameworks can significantly improve risk assessment in breast cancer care, leading to better-informed treatment decisions and personalized patient management strategies. This approach could be extended to other areas of oncology and healthcare where data fragmentation is a challenge.
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
Large Language Models
Efficient ML
Optimization
- ReLoRA enables efficient adaptation of LoRA adapters for evolving LLMs.
- The framework utilizes Bayesian optimization for adaptive initialization.
- Scheduled regularization is employed to enhance fine-tuning efficiency.
- ReLoRA significantly reduces time-to-readiness and improves accuracy.
Read more
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
Summary
The paper presents ReLoRA, a novel framework designed to facilitate the rapid adaptation of Low-Rank Adaptation (LoRA) adapters for evolving Large Language Models (LLMs). As LLMs are frequently updated, existing LoRA adapters may become incompatible, necessitating a costly retraining process. ReLoRA addresses this challenge by introducing a two-step optimization approach: first, it employs Bayesian optimization to create a compatibility-aware initialization for the LoRA adapter that integrates knowledge from both the previous adapter and the updated model. Second, it fine-tunes the adapter using scheduled regularization, which initially applies strong regularization to quickly reach a high-quality performance region, followed by relaxed regularization for task-specific adjustments. The experimental results demonstrate that ReLoRA significantly reduces the time required for service readiness by up to 8.9 times and enhances accuracy by up to 4.6% compared to traditional methods. This approach not only streamlines the adaptation process but also maintains or improves service quality, making it a valuable solution for service providers managing multiple LLM applications.
Methodology
ReLoRA consists of two main components: (1) Adaptive LoRA initialization, which uses Bayesian optimization to create a starting point that accounts for both the previous adapter and the changes in the base model, and (2) Fine-tuning with scheduled regularization, which first applies strong regularization to quickly navigate to a high-quality performance region, followed by a phase of relaxed regularization for fine-tuning.
Results
The implementation of ReLoRA resulted in a reduction of time-to-readiness by up to 8.9 times and an improvement in accuracy by up to 4.6% compared to baseline methods, demonstrating its effectiveness in quickly adapting to evolving LLM services.
Implications
ReLoRA has significant implications for service providers utilizing LLMs, as it allows for faster updates and maintenance of task-specific services without the need for extensive retraining, thereby reducing operational costs and improving service delivery.
Identifying Gems from Roman RAPIDly
Time Series
- Introduction of RuBR model for classifying astronomical transients.
- Development of three model variations to handle different training scenarios.
- Methodology emphasizes domain adaptation for transitioning to real data.
- Experimental results show effectiveness in distinguishing genuine detections.
Read more
Identifying Gems from Roman RAPIDly
Summary
The paper presents a machine learning framework aimed at identifying genuine astronomical transients and variable objects from the alerts generated by the upcoming Nancy Grace Roman Space Telescope (Roman). Scheduled for launch in September 2026, Roman will conduct extensive infrared imaging surveys, producing millions of transient detections. However, the absence of real data necessitates the development of automated pipelines to filter out spurious detections. The authors introduce a model named RuBR, which includes three variations: RuBRcomb, RuBRloc, and RuBRDA, each trained on simulated data to distinguish between real and bogus detections. The methodology leverages domain adaptation techniques to prepare for the transition to real observational data, which is expected to differ from simulated datasets. The experimental results indicate that the proposed models effectively classify transients and variables, demonstrating their potential for robust real-bogus classification in the context of the Roman mission. This work lays the groundwork for future astronomical studies and ensures that the Roman telescope can efficiently identify and classify transients as soon as it begins operations.
Methodology
The authors developed a machine learning model called RuBR, which includes three variations: RuBRcomb (trained on combined simulated transients), RuBRloc (trained on locally injected transients), and RuBRDA (combining local and some simulated transients for domain adaptation). The models were trained using simulated data to prepare for the classification of real observations from the Roman telescope, employing techniques to adapt to the expected differences in real data artifacts.
Results
The experimental results demonstrated that the RuBR models effectively classify genuine transients from bogus detections, outperforming several baseline models. The models showed promise in handling the challenges posed by the high density of sources and the imbalance between genuine and spurious detections.
Implications
The findings have significant implications for the upcoming Roman Space Telescope mission, ensuring that astronomers can quickly and accurately identify and classify transient events. This will facilitate timely follow-up observations and enhance the understanding of various astronomical phenomena.
Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?
Theory
- Existing benchmarks for MLE agents do not adequately assess their adherence to fairness and responsibility constraints.
- The proposed evaluation framework emphasizes domain-centric design and the impact of technical expertise on MLE agent usability.
- Exploratory experiments show that MLE agents underperform compared to human-designed pipelines in terms of fairness and predictive quality.
- The study highlights the importance of human oversight in the development of ML pipelines, especially in sensitive applications.
Read more
Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?
Summary
This paper investigates the ability of Machine Learning Engineering (MLE) agents to adhere to fairness constraints in sensitive domains, particularly in the context of melanoma classification. The authors highlight a responsibility gap where end-users may lack insight into the design choices of MLE agents, which can impact the correctness, robustness, and fairness of ML pipelines. They propose a new evaluation framework focused on responsibility properties and conduct an exploratory study comparing two recent MLE agents against manually designed baselines. The findings reveal that agent-generated pipelines exhibit high variance and consistently underperform in both predictive quality and fairness, even when prompted to prioritize fairness. The authors emphasize the need for further research to redesign MLE agents to better incorporate human oversight and ensure compliance with fairness standards.
Methodology
The authors propose a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification. They evaluate two MLE agents against manually designed baselines, focusing on fairness across skin tones as a key responsibility constraint.
Results
The exploratory study found that the pipelines generated by MLE agents showed high variance and consistently underperformed compared to manually designed baselines in both predictive quality and fairness, despite being prompted to prioritize fairness.
Implications
The findings suggest that MLE agents may not be suitable for deployment in sensitive domains without significant redesign to incorporate human oversight and ensure adherence to fairness and regulatory standards. This has implications for the development and deployment of ML systems in healthcare and other regulated fields.
LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models
Efficient ML
Theory
- Introduction of RaBEL, a Radial Basis Embedding Layer that improves feature representation and conditioning.
- Reordering of attention mechanisms to enhance the aggregation of column-level statistics before feature-level attention.
- LimiX-2M achieves superior performance compared to larger models while being more computationally efficient.
- Identification and quantification of low-rank collapse issues in traditional tabular foundation models.
Read more
LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models
Summary
The paper introduces LimiX-2M, a novel tabular foundation model (TFM) designed to address the inefficiencies associated with existing models, particularly focusing on low-rank collapse and attention bottlenecks. Traditional TFMs often utilize linear scalar tokenization, which limits the expressiveness of feature representations and leads to redundant hidden states. The authors propose a unified tokenize-and-route framework that includes a Radial Basis Embedding Layer (RaBEL) to enhance the conditioning and effective rank of shallow layers by expanding scalar inputs into localized RBF features. Additionally, they reorder the attention mechanism from feature-attention to sample-attention, allowing for better aggregation of column-level statistics before feature mixing. This new architecture results in LimiX-2M, a 2M-parameter model that outperforms larger models like TabPFN-v2 and TabICL on standard tabular benchmarks while also reducing computational costs. The findings suggest that improved tokenization and attention routing can significantly enhance the accuracy-efficiency trade-off in TFMs.
Methodology
The authors developed a new embedding layer (RaBEL) that replaces traditional linear projections with localized RBF features, enhancing the model's ability to capture complex relationships in data. They also redesigned the attention mechanism to prioritize sample-level attention before feature-level attention, allowing for better contextual understanding and feature aggregation.
Results
LimiX-2M, with only 2 million parameters, outperformed larger models such as TabPFN-v2 (7 million parameters) and TabICL (27 million parameters) across various tabular benchmarks, demonstrating significant improvements in both accuracy and computational efficiency.
Implications
The advancements presented in LimiX-2M could lead to more efficient and effective tabular data processing in various applications, including finance, healthcare, and any domain relying on structured data analysis. The findings also suggest pathways for future research in improving model architectures for tabular data.
The Right Measure for Physics-Constrained Generation: A Co-Area Correction for Posterior-Consistent PDE Inverse Problems
Generative Models
Theory
- Identifies a systematic bias in physics-constrained generative posteriors due to the omission of the co-area correction.
- Demonstrates that the bias can inflate posterior errors significantly, particularly in heterogeneous constraint sensitivities.
- Introduces CoCoS, a new constrained sampler that accurately targets the correct co-area posterior.
- Validates the necessity of the Fixman correction through controlled experiments against an i.i.d. ground-truth arbiter.
Read more
The Right Measure for Physics-Constrained Generation: A Co-Area Correction for Posterior-Consistent PDE Inverse Problems
Summary
This paper addresses a critical issue in the use of generative models for solving partial differential equation (PDE) inverse problems, specifically the bias introduced when enforcing hard PDE constraints without considering the co-area correction. The authors argue that conditioning a generative prior on a hard PDE constraint leads to sampling from the wrong distribution due to the measure-zero nature of the constraint manifold. They highlight that existing methods, which project or guide samples onto the constraint set, omit a crucial Jacobian factor that accounts for the sensitivity of the constraint. This omission can inflate posterior errors significantly, with the authors demonstrating that the bias grows with the heterogeneity of the constraint sensitivity. To rectify this, they introduce CoCoS, a measure-aware constrained sampler that accurately targets the correct co-area posterior. The paper validates the necessity of the Fixman correction through controlled experiments, showing that omitting it can lead to substantial errors in uncertainty quantification. The findings emphasize that satisfying physical constraints does not equate to correctly sampling the posterior, providing a principled approach for uncertainty-aware scientific inference.
Methodology
The authors analyze the effects of conditioning on a hard PDE constraint and derive the necessary co-area correction using measure theory. They introduce CoCoS, a constrained sampler that incorporates the Fixman correction to target the correct posterior distribution. The methodology includes controlled experiments to validate the bias introduced by existing methods and to demonstrate the effectiveness of CoCoS.
Results
The results indicate that omitting the co-area correction can inflate posterior errors by up to 20 times the sampling noise floor. The minimal-displacement projection method is shown to be biased at 9 times the noise floor. CoCoS successfully matches the gold-standard posterior to within sampling noise, confirming the importance of the Fixman correction.
Implications
The findings have significant implications for scientific inference in fields governed by PDEs, suggesting that existing generative models may require adjustments to ensure accurate uncertainty quantification. The introduction of CoCoS provides a new tool for researchers to improve the reliability of their models in physics-constrained scenarios.
The Impact of Temporal Granularity on Socio-Demographic Inference from Household Load Profiles
Time Series
- Coarser temporal granularity reduces predictive accuracy but reveals stable performance plateaus.
- Handcrafted and ts-fresh features are competitive with CNN-based embeddings, with XGBoost being the most effective classifier.
- Static attributes can be inferred from coarse data, while dynamic attributes require fine-grained signals.
- The study provides insights into the privacy-utility trade-off in smart metering.
Read more
The Impact of Temporal Granularity on Socio-Demographic Inference from Household Load Profiles
Summary
This paper investigates how the temporal granularity of household load profiles affects the predictability of socio-demographic attributes, addressing a critical gap in understanding privacy concerns associated with smart meter data. The authors analyze load profiles with granularities ranging from 15 minutes to 7 days, using a dataset of 1,589 households over one year. They introduce a novel evaluation framework that trains classifiers on year-round data while testing them on arbitrary weeks, thereby requiring generalization across seasonal and weekly variations. The findings reveal that while coarser granularity reduces predictive accuracy, there are stable performance plateaus at 15 minutes to 1 hour and again from 1 to 7 days, indicating opportunities for data minimization without significant loss of utility. The study also compares different feature extraction methods, finding that handcrafted features and ts-fresh features perform competitively with CNN-based embeddings, while XGBoost consistently outperforms other classifiers. Additionally, feature importance analysis shows that static attributes like dwelling size can be inferred from coarse data, whereas dynamic attributes like swimming pool usage require finer granularity. Overall, the research highlights the interplay between temporal resolution, feature extraction, and classifier choice in socio-demographic inference, contributing valuable insights into the privacy-utility trade-off in smart metering.
Methodology
The authors employed a systematic analysis of household load profiles at varying temporal granularities (15 minutes to 7 days) to predict eight socio-demographic attributes. They trained classifiers on a year-long dataset and tested them on arbitrary weeks to assess generalization capabilities. Various feature extraction methods were compared, including handcrafted features, ts-fresh, and CNN-based embeddings, alongside multiple machine learning models for robustness testing.
Results
The study found that predictive accuracy decreases with coarser granularity, but stable performance is observed between 15 minutes to 1 hour and again from 1 to 7 days. XGBoost outperformed other classifiers, and feature importance analysis indicated that dwelling size could be inferred from coarse data, while swimming pool usage required finer granularity.
Implications
The findings underscore the importance of balancing data utility with privacy in smart metering. The ability to infer socio-demographic characteristics from load profiles poses risks of discriminatory practices in insurance, advertising, and financial assessments. This research informs future smart metering deployments to mitigate privacy concerns while maintaining analytical utility.
Trading Human Curation for Synthetic Augmentation in RLVR
Reinforcement Learning
Large Language Models
Efficient ML
- Synthetic task augmentation can effectively substitute for human curation in RLVR.
- The cost-adjusted trade rate (ρcost) between augmented and human-authored tasks is established and measured.
- High-share augmentation retains generalization performance comparable to a larger set of human-authored tasks.
- The study provides insights into the economics of task generation for RLVR.
Read more
Trading Human Curation for Synthetic Augmentation in RLVR
Summary
This paper addresses the challenge of scaling high-quality training tasks for reinforcement learning from verifiable rewards (RLVR) in agentic language models. The authors highlight that the current reliance on human curation for task generation is economically impractical due to the extensive time and resources required. To mitigate this bottleneck, the study investigates the use of synthetic task augmentations as a substitute for additional human-authored tasks. The authors formalize a cost-adjusted trade rate (ρcost) to evaluate the effectiveness of augmented tasks compared to human-authored ones. Through controlled ablation studies, they demonstrate that a small base of hand-authored tasks can be effectively augmented to achieve similar performance levels as a larger set of human-authored tasks. The findings indicate that high-share augmentation can maintain generalization performance across various benchmarks while significantly reducing data curation costs. This research contributes to the understanding of how synthetic data can be leveraged in RLVR, providing a pathway to more scalable training methodologies.
Methodology
The authors conducted a controlled ablation study to measure the trade rate (ρ) between augmented and human-authored tasks. They defined various experimental arms with different combinations of hand-authored and augmented tasks, and evaluated their performance on a ten-benchmark suite. The study also characterized the pipeline calibration regime and assessed the cost-effectiveness of the augmentation strategy.
Results
The results showed that 80 augmented variants achieved performance within 0.20 percentage points of 97 fully hand-authored tasks, while 319 augmented variants outperformed the human-only baseline by +0.96 percentage points. The cost-adjusted trade rate (ρcost) remained between 1.4× and 11.6× across different human/augmented task cost ratios, indicating a significant cost advantage for using augmented tasks.
Implications
This research suggests that synthetic task augmentation can be a viable strategy for scaling RLVR training, potentially leading to more efficient and cost-effective training pipelines. It opens avenues for further exploration into synthetic data generation techniques in reinforcement learning contexts.
Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection
Computer Vision
Theory
- Identification of score-direction instability in class-split anomaly detection due to normal-anomaly overlap.
- Introduction of a training-free diagnostic tool, neighborhood class leakage, to assess the reliability of class-split evaluations.
- Empirical validation across various datasets and representation methods to demonstrate the diagnostic's effectiveness.
- Highlighting the fragility of benchmark conclusions in anomaly detection when relying on class-split evaluations.
Read more
Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection
Summary
This paper investigates the reliability of within-dataset class-split evaluation methods for anomaly detection (AD), particularly when the anomaly class overlaps with the normal class in representation space. The authors highlight that such overlap can lead to instability in anomaly scores, where scores may collapse towards chance or even invert, depending on the unknown anomaly class. They introduce a training-free diagnostic tool called neighborhood class leakage, which quantifies the extent of class overlap in the representation space. Through empirical validation across multiple datasets (Fashion-MNIST, CIFAR-10, and Imagenette) and representation methods (pixel space and VAE latent space), the authors demonstrate that their diagnostic can predict when class-split benchmarks are unreliable. The findings suggest that class-split AD benchmarks should be viewed as geometry-dependent stress tests rather than definitive evidence of anomaly detection capability.
Methodology
The authors propose a diagnostic index based on local neighborhood class leakage to quantify the overlap between normal and anomaly classes in representation space. They analyze datasets using both pixel space and VAE latent representations, employing multiple scoring methods (kNN, Isolation Forest, Local Outlier Factor) to evaluate the stability of anomaly detection scores.
Results
The study finds that high values of the neighborhood class leakage index correlate with instability in anomaly detection scores, including AUROC collapse and inversion. The diagnostic successfully predicts when class-split evaluations are likely to yield unreliable results, demonstrating its utility across different datasets and representations.
Implications
The findings have significant implications for the evaluation of anomaly detection methods, suggesting that researchers should be cautious when interpreting results from class-split benchmarks. The proposed diagnostic can aid in identifying unreliable evaluations, ultimately leading to more robust anomaly detection systems.
Pruning Deep Neural Networks via the Marchenko--Pastur Distribution
Efficient ML
Theory
Computer Vision
- Introduces a Marchenko–Pastur random-matrix approach for pruning DNNs.
- Achieves accuracy retention with minimal post-pruning fine-tuning (one epoch).
- Provides deterministic data-path certificates for effective pruning.
- Demonstrates significant MAC reduction while maintaining competitive accuracy on ImageNet-1k.
Read more
Pruning Deep Neural Networks via the Marchenko--Pastur Distribution
Summary
This paper introduces a novel approach to pruning deep neural networks (DNNs) using the Marchenko–Pastur (MP) distribution, focusing on maintaining accuracy with minimal post-pruning fine-tuning. The authors propose a method that allows for effective pruning while ensuring that the accuracy loss is minimized, even with short calibration schedules. The theoretical framework provides deterministic certificates for data-path efficiency, indicating that if the removed components have a small propagated logit effect, the pruning will not adversely affect the model's performance. The study emphasizes the practical application of this method on various models, particularly Vision Transformers (ViTs) and CNNs, demonstrating significant reductions in computational complexity without substantial accuracy drops. The results show that after minimal fine-tuning, the pruned models achieve competitive performance on the ImageNet-1k dataset, showcasing the effectiveness of the MP distribution in guiding the pruning process.
Methodology
The authors utilize a random-matrix theory approach, specifically the Marchenko–Pastur distribution, to analyze and allocate pruning masks without requiring validation or test access. The methodology includes theoretical proofs for the effectiveness of pruning based on the propagated logit effect and the use of structured sparsity to enhance model performance.
Results
The empirical results indicate that the ViT-B/16 model achieves 83.41% top-1 accuracy with a 59.81% reduction in sparse-execution MAC after just three distillation epochs. Other models, including ConvNeXtV2-Base and ResNet architectures, also demonstrate competitive performance with minimal accuracy loss, confirming the effectiveness of the proposed pruning strategy.
Implications
This research has significant implications for the deployment of DNNs in resource-constrained environments, as it allows for efficient model compression without sacrificing performance. The findings can be applied to various applications in computer vision and beyond, where model efficiency is critical.
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent
Optimization
Theory
- Establishes Kreiss-constant bounds for block-triangular Jacobians in coupled gradient descent.
- Identifies a critical coupling threshold for spectral instability.
- Introduces a finite-horizon iteration complexity bound for stochastic coupled descent.
- Demonstrates the significance of transient amplification in high-dimensional learning dynamics.
Read more
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent
Summary
This paper addresses the dynamics of coupled gradient descent, a method where the update of one parameter is dependent on another, commonly found in bilevel optimization, two-time-scale stochastic approximation, and generative adversarial networks. The author develops a pseudospectral theory for block-triangular Jacobians, establishing Kreiss-constant bounds that quantify transient amplification before convergence. The results reveal that even with asymptotic stability indicated by spectral radii, transient amplification can be significant due to non-normality in high-dimensional learning scenarios. The paper introduces a critical coupling threshold for spectral instability and extends the theory to nearly self-referential systems using a Neumann-series perturbation framework. The findings include a finite-horizon iteration complexity bound and scaling laws for stochastic optimization, which highlight the non-asymptotic, instance-dependent nature of learning dynamics that are not captured by traditional spectral-radius analysis. Experimental validations on linear-quadratic problems and neural network training support the theoretical claims.
Methodology
The methodology involves a block-wise resolvent analysis of the Kreiss constant for block-triangular matrices. The author derives bounds for the transient amplification of coupled gradient descent dynamics by analyzing the spectral properties of the Jacobian matrix. The theory is extended to nearly self-referential systems using a Neumann-series perturbation framework, and the results are validated through experiments on linear-quadratic problems and neural networks.
Results
The paper presents several key results, including: (1) Kreiss-constant bounds for block-triangular Jacobians, (2) a minimax lower bound for the Kreiss constant, (3) a critical coupling threshold for spectral instability, (4) a perturbative extension for nearly self-referential systems, and (5) a sample-complexity scaling law for stochastic coupled descent. The experiments confirm the theoretical predictions regarding transient amplification and iteration complexity.
Implications
The findings have significant implications for the design and analysis of optimization algorithms in machine learning, particularly in high-dimensional settings where transient dynamics can impact convergence rates. The results can inform the development of more robust optimization methods that account for non-normal dynamics and transient amplification effects.
An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization
Optimization
- Introduction of ELFM-DEGDO, combining differential evolution and gradient descent for latent factor modeling.
- The model addresses the limitations of traditional gradient descent-only approaches in handling HDI data.
- Empirical results show ELFM-DEGDO outperforms multiple advanced latent factor models on real datasets.
- The proposed self-adaptive weighting mechanism effectively fuses strengths from both optimization methods.
Read more
An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization
Summary
This paper addresses the challenges posed by high-dimensional and incomplete (HDI) data, which are common in real-world applications. Traditional latent factor models, while effective for representation learning, often rely solely on gradient descent optimization, leading to biased and incomplete representations. The authors propose a novel Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization (ELFM-DEGDO) that combines two distinct optimization strategies to enhance representation learning. The model independently employs differential evolution and gradient descent to create two diverse latent factor models, which are then fused using a self-adaptive weighting mechanism. This approach leverages the strengths of both optimization methods to produce more accurate and less biased latent representations. The effectiveness of ELFM-DEGDO is validated through extensive empirical studies on three HDI datasets, demonstrating superior performance compared to several existing latent factor models. The findings indicate that the proposed model can significantly improve the extraction of informative latent features from HDI data, making it a valuable contribution to the field of representation learning.
Methodology
The proposed ELFM-DEGDO model consists of two main components: (1) independent modeling of two latent factor models using differential evolution and gradient descent optimization, and (2) a self-adaptive weighting mechanism to combine the outputs of these models. The optimization objective is defined to minimize the reconstruction error of the HDI matrix while incorporating regularization to prevent overfitting.
Results
The empirical evaluation on three HDI datasets reveals that ELFM-DEGDO consistently outperforms existing latent factor models, demonstrating its capability to produce more comprehensive and less biased representations of high-dimensional and incomplete data.
Implications
The findings suggest that ELFM-DEGDO can be effectively applied in various domains requiring representation learning from HDI data, such as recommender systems, online services, and other applications involving sparse and incomplete datasets.
Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models
Time Series
- Introduces a training-free method for survival regression using Tabular Foundation Models.
- Constructs an Accelerated Failure Time model with a single scalar parameter.
- Implements a non-parametric in-context estimator for imputing right-censored data.
- Demonstrates competitive performance against traditional survival regression models.
Read more
Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models
Summary
This paper addresses the challenges of survival analysis (SA), particularly the issue of right censoring in time-to-event data, by leveraging Tabular Foundation Models (TFMs). The author proposes a training-free method for survival regression that utilizes TFMs to predict event times and iteratively impute right-censored data. The method constructs an Accelerated Failure Time (AFT) model with a single scalar parameter, requiring no dataset-specific training. Additionally, it introduces a non-parametric in-context estimator based on the Buckley-James estimator to handle right-censored data. Experiments conducted on standard survival analysis benchmarks demonstrate that this approach achieves competitive performance compared to traditional parametric and semi-parametric survival regression models, such as Cox regression and parametric AFT models. The findings suggest that TFMs can effectively be applied to survival analysis, providing a novel solution to the limitations posed by right censoring.
Methodology
The proposed method frames survival regression as an in-context prediction task using TFMs. It constructs an AFT model by regressing the target through in-context learning and estimates the model's parameters, including a scalar parameter for the AFT model. The method also employs an iterative mechanism similar to the Buckley-James estimator to impute censored data, allowing for effective handling of right-censored observations.
Results
The experiments on five widely used survival analysis benchmarks indicate that the proposed method achieves performance comparable to classical parametric and semi-parametric survival models that require training. This suggests that the approach is viable for practical applications in survival analysis.
Implications
The findings of this paper could lead to improved methodologies in survival analysis across various domains, including healthcare and churn prediction, by providing a robust framework that accommodates right censoring without the need for extensive training.
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Large Language Models
NLP
Interpretability
- Supervised fine-tuning improves LLMs' ability to encode action validity and state predicates.
- LLMs can learn internal representations that differentiate valid and invalid actions despite challenges in probability classification.
- Fine-tuning with diverse state space coverage leads to better world model recovery.
- The study provides insights into the representation and reasoning capabilities of LLMs in planning contexts.
Read more
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Summary
This paper investigates whether supervised fine-tuning (SFT) allows large language models (LLMs) to effectively represent and reason about classical planning problems. The authors conduct interpretability experiments to assess world model recovery in fine-tuned LLMs, focusing on both internal representations and generative capabilities. They find that SFT on valid action sequences enables LLMs to encode action validity and some state predicates. Additionally, models may learn internal representations that distinguish valid from invalid actions, even if they struggle with output probabilities. The study highlights that broader state space coverage during fine-tuning enhances the recovery of the underlying world model. The findings contribute to understanding how knowledge is represented in LLMs and provide a framework for applying interpretability techniques to planning tasks.
Methodology
The authors fine-tune a collection of gemma2-9b-instruct models using supervised fine-tuning on end-to-end plan generation tasks. They evaluate world model recovery through two definitions: internal representations and generative capabilities. Internal representations are assessed using linear probes on hidden states, while generative capabilities are evaluated by classifying action validity based on generated token probabilities. The fine-tuning process is explored with different data distributions and chain-of-thought techniques.
Results
The experiments reveal that SFT on valid action sequences allows LLMs to learn internal representations that encode action validity and some state predicates. Furthermore, LLMs that struggle with probability-based classification may still possess high-quality internal representations. The study also shows that fine-tuning on planning data with comprehensive state space coverage significantly enhances world model recovery, both in-distribution and out-of-distribution.
Implications
The findings suggest that LLMs can be effectively fine-tuned to improve their planning capabilities, which could have applications in automating complex workflows, software development, and scientific discovery. The insights into internal representations may also inform future research on enhancing reasoning capabilities in LLMs.
Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification
Graph Learning
- Introduction of two graph-guided Universum learning models for Alzheimer's disease classification.
- Utilization of mild cognitive impairment (MCI) subjects as Universum data to enhance classification between AD and CN.
- Construction of a graph using Gaussian similarity and Minimum Spanning Tree to capture geometric relationships among Universum samples.
- Demonstrated superior performance of the proposed models over existing methods, particularly under noisy conditions.
Read more
Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification
Summary
This paper addresses the challenge of early and accurate detection of Alzheimer's disease (AD) through improved classification techniques using structural MRI data. The authors propose two novel models, UG-GEPSVM and IUG-GEPSVM, which enhance the Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) framework by incorporating graph-guided Universum learning. Unlike traditional methods that treat Universum samples as independent points, the proposed models leverage the geometric relationships among these samples, specifically using mild cognitive impairment (MCI) subjects as Universum data to bridge the gap between cognitively normal (CN) and AD classes. A graph is constructed based on Gaussian similarity and Minimum Spanning Tree connectivity, from which a Laplacian matrix is derived to capture the geometric structure of MCI samples. This Laplacian-based regularization is integrated into the learning process, leading to improved classification performance. Experimental results on the ADNI MRI dataset demonstrate that both UG-GEPSVM and IUG-GEPSVM consistently outperform existing GEPSVM and Universum-based methods, achieving an average AUC of 88.07% and maintaining stability under various noise levels. Statistical tests confirm the significance of these improvements, indicating the effectiveness of the proposed graph-guided approach in AD classification.
Methodology
The proposed models, UG-GEPSVM and IUG-GEPSVM, incorporate graph-guided Universum regularization into the GEPSVM framework. A graph is constructed over MCI Universum samples, and a Laplacian matrix is derived to capture their geometric structure. This regularization is integrated into the learning process, allowing the models to exploit both Universum information and its underlying geometric relationships. UG-GEPSVM formulates the problem as a generalized eigenvalue problem, while IUG-GEPSVM extends the numerically stable IGEPSVM framework using a standard eigenvalue formulation.
Results
The experiments conducted on the ADNI MRI dataset show that both UG-GEPSVM and IUG-GEPSVM outperform existing GEPSVM and Universum-based methods, achieving an average AUC of 88.07%. The models maintain stable performance across five different noise levels, and statistical tests confirm the significance of the observed improvements.
Implications
The proposed graph-guided Universum learning approach has the potential to significantly enhance the accuracy of Alzheimer's disease classification, which is crucial for early diagnosis and intervention. This methodology could be applied to other medical imaging classification tasks where intermediate classes exist, improving overall diagnostic accuracy.
Folded Transport MCMC: Certifiable Quotient Posterior Computation for Symmetric Bayesian Models
Theory
- Introduces Folded Transport MCMC (FolT-MCMC) for symmetric Bayesian models.
- Directly samples from the quotient posterior to avoid label-switching issues.
- Proves theoretical guarantees for the method's convergence and certification.
- Demonstrates significant empirical improvements in convergence diagnostics.
Read more
Folded Transport MCMC: Certifiable Quotient Posterior Computation for Symmetric Bayesian Models
Summary
This paper introduces Folded Transport MCMC (FolT-MCMC), a novel approach for performing inference on symmetric Bayesian models that exhibit finite symmetry, such as mixture models with exchangeable components. Traditional MCMC methods struggle with label-switching and redundant multimodality, which can degrade convergence diagnostics. FolT-MCMC addresses these issues by directly sampling from the quotient posterior, which is defined on the fundamental domain of the symmetry group. The method utilizes a symmetrized normalizing flow to construct proposal densities, effectively avoiding the complications associated with label-switching. The paper establishes theoretical guarantees for the method, demonstrating that the LCNF oscillation-based certification framework can be adapted to the quotient metric, leading to improved convergence diagnostics. Empirical results show significant improvements in the quantile-core certified lower bounds across various experiments, including synthetic Gaussian mixtures and real-world accelerometer data during Typhoon Mangkhut, where FolT-MCMC outperformed traditional methods.
Methodology
FolT-MCMC constructs an independence sampler on the fundamental domain of the symmetry group, utilizing a symmetrized normalizing flow to create proposal densities. The method leverages the LCNF oscillation-based certification framework to provide theoretical guarantees and improve convergence diagnostics.
Results
FolT-MCMC achieved quantile-core certified improvement ratios ranging from 2× to 145× on symmetric Gaussian mixtures and demonstrated the ability to produce non-vacuous quantile-core certificates on real accelerometer data, where traditional methods yielded vacuous results.
Implications
The proposed method has the potential to enhance inference in various symmetric Bayesian models, improving the reliability of convergence diagnostics and enabling more effective analysis of complex data structures, particularly in fields requiring robust statistical modeling under symmetry constraints.
QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models
Multimodal
Theory
Graph Learning
- QUIVER integrates quantum Fisher information into classical machine learning models to enhance feature representation.
- The method is architecture-agnostic, allowing integration into various model types such as transformers and graph neural networks.
- Significant performance improvements were observed on QM9 and JETCLASS benchmark datasets compared to classical baselines.
- The quantum Fisher view exposes higher-order correlations that classical representations may overlook.
Read more
QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models
Summary
The paper introduces QUIVER (QUantum-Informed Views for Enhanced Representations), a novel approach that enhances classical machine learning models by integrating quantum Fisher information. This method provides a geometrically motivated, basis-independent summary of higher-order correlations through a variational quantum circuit (VQC). Unlike traditional feature augmentation, the quantum Fisher information matrix captures the intrinsic geometry of the learned quantum state manifold, revealing statistical structures that classical data or model capacity may struggle to learn. The authors demonstrate the effectiveness of QUIVER on two benchmark datasets: QM9 for molecular property prediction and JETCLASS for jet flavor classification at the Large Hadron Collider (LHC). The core contribution of the paper is the architecture-agnostic nature of QUIVER, allowing it to be integrated into various model architectures through targeted modifications. The results indicate that quantum-geometric features extracted from simulated variational circuits can significantly enhance performance in standard machine learning tasks, paving the way for future applications even before the availability of fault-tolerant quantum hardware.
Methodology
The authors utilize a variational quantum circuit (VQC) to map classical data into a quantum state, extracting the quantum Fisher information matrix (QFIM) to capture geometric and correlation structures. They propose a fusion of the quantum Fisher view with classical data through modifications in model architectures, including cross-attention mechanisms for transformers and modulation of graph messages for GNNs.
Results
QUIVER consistently outperformed classical models, including a state-of-the-art jet tagger and a leading model for molecular property prediction, demonstrating measurable improvements in performance metrics across both benchmark datasets.
Implications
The findings suggest that incorporating quantum-geometric features can enhance the capabilities of classical machine learning models, potentially leading to advancements in fields such as high-energy physics and molecular chemistry. This work also lays the groundwork for future research into quantum-enhanced machine learning techniques.
ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Reinforcement Learning
Large Language Models
Optimization
- ASymPO normalizes token loss to stabilize training without behavior-policy probabilities.
- Identifies scale-imbalance failure mode in current-policy-only asynchronous RL.
- Proposes Scaled Policy Optimization (SPO) as a simpler baseline method.
- Empirical evaluations show ASymPO's effectiveness in asynchronous mathematical reasoning tasks.
Read more
ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Summary
The paper introduces ASymPO, an innovative approach to stabilize asynchronous reinforcement learning (RL) for language model post-training without relying on behavior information. Traditional methods to mitigate distribution drift during asynchronous training require behavior-policy probabilities, which can complicate the training pipeline. The authors identify a critical failure mode where stale responses can lead to scale imbalances in loss contributions, destabilizing training. ASymPO addresses this by normalizing token loss based on current average token negative log-probabilities, effectively balancing positive and negative loss contributions without needing behavior-policy probabilities. The paper also presents Scaled Policy Optimization (SPO) as a simpler baseline. The authors empirically evaluate both ASymPO and SPO in the context of asynchronous mathematical reasoning post-training, demonstrating that ASymPO maintains a compact rollout-learner interface while ensuring stability and effectiveness in training.
Methodology
The authors propose ASymPO, which normalizes the loss of each response based on its current average token negative log-probability. This adaptive scaling helps balance the contributions of positive and negative responses during training. They also introduce SPO as a fixed scaling method for comparison. Both methods are evaluated in the context of asynchronous mathematical reasoning post-training.
Results
The empirical evaluation indicates that ASymPO effectively stabilizes training and preserves a nonzero learning signal, outperforming naive current-policy training and behavior-corrected methods in terms of stability and performance across three model families.
Implications
The findings suggest that ASymPO can simplify the training of large language models in asynchronous settings, potentially leading to more efficient and effective post-training processes without the complexities introduced by behavior information.
Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling
Efficient ML
- Introduction of EmaQ for efficient multi-domain quantization addressing domain shifts.
- Extension to EmaQ-LT for long-tailed quantization, mitigating majority-class overconfidence.
- Theoretical convergence guarantees for the proposed quantization methods.
- Sensitivity-aware weight aggregation to harmonize convergence across diverse domains.
Read more
Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling
Summary
This paper addresses the challenges of quantizing deep neural networks in multi-domain and long-tailed scenarios, which are often overlooked in existing methods that assume single-domain and balanced data. The authors propose Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a cumulative distribution function (CDF)-based projection and employs sensitivity-aware weight aggregation to stabilize the quantization process across multiple domains. To tackle long-tailed quantization, they extend EmaQ to EmaQ-LT, introducing class-conditioned variance scaling and confidence-based logit adjustment to reduce the overconfidence of majority classes. Theoretical analyses provide convergence guarantees for the proposed methods, ensuring stability during training. Experimental results on various benchmarks demonstrate that both EmaQ and EmaQ-LT achieve strong low-bit performance, effectively handling domain shifts and class imbalances while maintaining accuracy.
Methodology
The methodology involves two main components: Efficient Multi-Domain Alignment Quantization (EmaQ) and its extension EmaQ-LT. EmaQ utilizes Domain Alignment Quantization Training (DAQT) for aligning domain distributions and employs an alignment quantization gradient descent (AQGD) mechanism for stable gradient approximation. Sensitivity-aware Weight Aggregation (SWA) is introduced to manage varying sensitivities across domains. EmaQ-LT incorporates class-conditioned variance scaling and a homogenized loss function to address class imbalances in long-tailed data.
Results
The experiments conducted on multi-domain datasets (Office-31, Digits) and long-tailed datasets (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) show that EmaQ and EmaQ-LT significantly outperform existing quantization methods, achieving robust low-bit performance even in the presence of domain shifts and class imbalance.
Implications
The proposed methods have significant implications for deploying deep learning models on resource-constrained devices, particularly in real-world applications where data is often multi-domain and imbalanced. This research can enhance the efficiency and reliability of machine learning models in various fields, including mobile computing and edge AI.
RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting
Time Series
- Introduction of RESCAST-100K, a comprehensive dataset for residential energy forecasting.
- Dataset includes 100,000 simulated homes with detailed time series data and exogenous variables.
- Configuration-driven interface allows for systematic evaluation across various domains.
- Benchmarking shows that cross-attention and MLP-mixer models outperform traditional architectures.
Read more
RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting
Summary
The paper introduces RESCAST-100K, a large-scale dataset designed to enhance the forecasting of residential energy load and indoor temperature across diverse domains. Recognizing the challenges posed by data scarcity and heterogeneity in residential settings, the authors developed a benchmark that includes approximately 100,000 EnergyPlus-simulated homes in the U.S. The dataset provides 15-minute time series data for total load, HVAC load, and indoor temperature, along with weather data and static building characteristics. A unique feature of RESCAST-100K is its configuration-driven interface, allowing researchers to evaluate forecasting models under various domain shifts, including geography and climate. The authors benchmark several model architectures, including recurrent, attention-based, and MLP-mixer models, demonstrating that cross-attention and MLP-mixer models outperform traditional approaches in zero-shot performance and sim-to-real evaluations. This dataset aims to support advancements in machine learning for residential energy forecasting, facilitating improved energy management and grid-level efficiency.
Methodology
The authors created RESCAST-100K by integrating high-fidelity physics-based simulations with real-world datasets, providing a unified schema for controlled experimentation. They employed a configuration-driven approach to define domain splits and benchmarked various forecasting models, including recurrent networks, transformers, and MLP-mixers, under different conditions, including missing input data.
Results
The results indicated that cross-attention and MLP-mixer models consistently achieved the highest accuracy in forecasting tasks across different domains, particularly under conditions of domain shift and missing data. This highlights the effectiveness of these architectures in handling the complexities of residential energy forecasting.
Implications
RESCAST-100K is expected to significantly advance research in residential energy forecasting, enabling better energy management systems, improved grid demand response strategies, and enhanced community-level energy efficiency programs. It provides a robust framework for future studies on transfer learning and domain adaptation in energy forecasting.
APIC: Amortized Physics-Informed Calibration using Neural Processes
Theory
Generative Models
Time Series
- APIC combines the probabilistic rigor of KOH with the efficiency of amortized inference.
- The dual-latent architecture effectively disentangles physical parameters from model discrepancies.
- APIC enables rapid calibration from sparse observations while quantifying uncertainty.
- Experimental results show improved predictive performance and reliable parameter recovery across multiple dynamical systems.
Read more
APIC: Amortized Physics-Informed Calibration using Neural Processes
Summary
The paper introduces Amortized Physics-Informed Calibration (APIC), a novel framework that enhances the Kennedy–O’Hagan (KOH) calibration method by leveraging Neural Processes for scalable Bayesian inference across related systems. Traditional KOH methods, while effective in modeling discrepancies between physical simulations and real-world observations, are limited by their non-amortized approach, requiring recalibration for each new instance. APIC addresses this limitation by employing a dual-latent architecture that separates instance-specific physical parameters from shared structural discrepancies, enabling efficient calibration from sparse observations. The framework integrates differentiable physics into an amortized inference model, allowing for rapid calibration while providing uncertainty quantification. The authors validate APIC through experiments on various dynamical systems, including a damped spring oscillator and the Lotka–Volterra system, demonstrating significant improvements in parameter recovery and consistent identification of systemic discrepancies compared to existing calibration methods.
Methodology
APIC employs a dual-latent encoder architecture to model a population of related system realizations, capturing instance-specific physical parameters and structured discrepancies. The training process occurs in two stages: first, learning the mapping from observations to physical parameters using simulator-generated data, followed by training on real observations while regularizing the discrepancy component.
Results
The experiments conducted on three dynamical systems revealed that APIC outperforms traditional calibration methods in terms of predictive accuracy, uncertainty quantification, and the recovery of both parameters and discrepancies, maintaining physical interpretability.
Implications
APIC has the potential to enhance the calibration of physical models in various scientific fields, enabling more accurate predictions and better understanding of model discrepancies. Its efficient approach could be applied to a wide range of dynamical systems, improving the integration of physics-based simulations with data-driven methodologies.
Reconciling Causality and Non-Equilibrium Thermodynamics with Hamiltonian Causal Models
Theory
- Introduction of Hamiltonian Causal Models (HCMs) for trajectory-level causal modeling.
- Separation of immutable equations of motion from intervenable mechanisms.
- Entropy production as a measurable causal observable that quantifies irreversibility.
- HCMs accommodate time-dependent, adaptive interventions and feedback loops.
Read more
Reconciling Causality and Non-Equilibrium Thermodynamics with Hamiltonian Causal Models
Summary
This paper introduces Hamiltonian Causal Models (HCMs), a novel framework for causal modeling that addresses the complexities of physical temporal phenomena, such as interventions along trajectories, nonstationary laws, and path-dependent effects. HCMs conceptualize trajectories as the primary causal entities, allowing interventions to be represented as time-dependent controls acting on Hamiltonian mechanisms. The framework distinguishes between immutable equations of motion and intervenable mechanisms, defining causal effects as discrepancies between interventional path laws. A significant aspect of HCMs is their integration with non-equilibrium thermodynamics, where entropy production serves as a key causal observable, quantifying the irreversibility of processes and revealing causal effects that are not captured by traditional average treatment effects. The authors argue that causality in physical systems is inherently tied to the thermodynamic arrow of time, emphasizing that causal relationships emerge from the non-invertibility of these processes. The paper provides a structured approach to modeling physical systems without the need for ad-hoc assumptions, facilitating the study of causal effects over finite time horizons and accommodating feedback and adaptive interventions.
Methodology
The authors develop Hamiltonian Causal Models (HCMs) that utilize a trajectory-level framework to model controlled Hamiltonian systems. The methodology includes defining causal effects through discrepancies in interventional path laws and employing thermodynamic principles to analyze causal interactions and entropy production rates.
Results
The paper demonstrates that work rates can identify local causal interactions when the Hamiltonian is known, and that entropy production serves as a data-estimable witness of causal effects. The authors also establish a local causal influence criterion for nonstationary dynamics based on local entropy production rates.
Implications
The findings suggest that HCMs can provide a robust framework for understanding causality in various physical systems, potentially influencing fields such as statistical mechanics, control theory, and causal inference in machine learning. The integration of thermodynamic concepts into causal modeling may lead to new insights and methodologies in both theoretical and applied contexts.