AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times
Reinforcement Learning
Optimization
Time Series
- Introduces a decision-focused reinforcement learning framework for EV charging.
- Addresses the challenge of unknown departure times in EV charging optimization.
- Implements end-to-end training of the forecaster and RL agent to improve decision quality.
- Demonstrates up to 14% improvement in total reward and 55% reduction in unsupplied energy.
Read more
Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times
Summary
This paper addresses the challenges posed by the increasing adoption of electric vehicles (EVs) on power systems, particularly focusing on the optimization of EV charging when departure times are unknown. Traditional reinforcement learning (RL) methods struggle with this uncertainty, as they typically rely on accurate forecasting of key features like departure times. The authors propose a novel decision-focused reinforcement learning (DF-RL) framework that integrates a forecaster trained end-to-end with the RL agent's decision-making process. This approach allows the forecaster to improve its predictions based on the quality of the RL agent's actions, rather than solely on accuracy. The DF-RL framework is shown to enhance the overall performance of EV charging decisions, achieving significant improvements in total rewards and reductions in unsupplied energy compared to conventional RL methods that do not incorporate departure time forecasting.
Methodology
The authors developed a decision-focused reinforcement learning (DF-RL) framework that jointly trains a forecaster and an RL agent. The forecaster predicts unknown features, such as departure times, while receiving feedback from the RL agent's actions to improve its predictions. This end-to-end training approach allows for better decision-making in EV charging scenarios.
Results
The DF-RL method outperformed baseline RL approaches, achieving up to a 14% increase in total reward and a 55% decrease in unsupplied energy. These results indicate that the integration of forecasting and decision-making significantly enhances the effectiveness of EV charging strategies.
Implications
The proposed DF-RL framework has potential applications in smart grid management and EV charging infrastructure, enabling more efficient energy use and improved user satisfaction. It also opens avenues for further research in decision-focused learning approaches in other domains where forecasting and decision-making are intertwined.
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
NLP
Large Language Models
Reinforcement Learning
- Introduces a hierarchical latent selection model for reasoning in LLMs.
- Demonstrates the complementary roles of supervised fine-tuning and reinforcement learning.
- Shows that RL can extract reusable atomic modules from compound reasoning traces.
- Finds that training on compound traces enhances generalization capabilities.
Read more
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
Summary
This paper explores the compositional generalization capabilities of large language models (LLMs) through a novel hierarchical latent selection model. The authors argue that the success of combining supervised fine-tuning (SFT) with reinforcement learning (RL) in enhancing reasoning performance is fundamentally driven by compositional generalization. They formalize this concept by modeling reasoning traces as outputs of a hierarchical latent selection process, where discrete selection variables correspond to reusable atomic modules, including skills (local operations) and routing mechanisms (how information is selected and composed). The paper presents a theoretical framework demonstrating the complementary roles of SFT and RL: SFT provides the foundational module materials, while RL identifies and decomposes these modules to enable generalization. Controlled experiments validate the theory, showing that RL can extract and recombine atomic modules from compound traces generated by SFT, leading to improved performance on novel configurations. The findings indicate that training on compound traces yields better generalization than isolated atomic modules, and an effective training protocol is proposed where SFT ensures coverage of all atomic modules while RL focuses on exploring novel compositions. This work contributes to understanding how LLMs can achieve robust reasoning through compositionality.
Methodology
The authors develop a hierarchical latent selection model to formalize compositional generalization in reasoning. They conduct controlled experiments on synthetic reasoning tasks to validate their theoretical claims, focusing on how RL can decompose compound reasoning traces into reusable atomic modules.
Results
The experiments reveal that RL effectively extracts atomic modules from compound traces, achieving accuracy comparable to direct supervision on atomic tasks while generalizing to unseen compositions. The strongest out-of-distribution performance is observed when SFT covers the atomic inventory and RL targets novel compositions.
Implications
The findings suggest that enhancing LLMs' reasoning capabilities through compositional generalization can lead to more robust AI systems. This work has implications for developing more effective training protocols for LLMs, potentially improving their performance in complex reasoning tasks.
TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduces TRIDENT, the first MARL framework co-designing hybrid-action, safety, and physics modules.
- Establishes a coupling lemma that formalizes the interdependencies of hybrid actions, safety, and physics in MARL.
- Achieves 95.5% reduction in training-time violations compared to MADDPG and 76.3% compared to MACPO.
- Demonstrates a 13.5% improvement in reward over the strongest unconstrained baseline.
Read more
TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
Summary
The paper presents TRIDENT, a novel framework for Multi-Agent Reinforcement Learning (MARL) that addresses the challenges posed by hybrid discrete-continuous actions, hard safety constraints during training, and physics-governed dynamics in networked cyber-physical systems. The authors identify a three-way coupling of biases that arises when these features are naively combined, leading to instability and inefficiency. To overcome this, TRIDENT integrates three co-designed components: a Richardson-Romberg gradient correction to mitigate Gumbel-Softmax bias, a Lyapunov-constrained sequential trust-region update to ensure feasibility at each iteration, and a physics-informed residual critic that decomposes value rather than reward. The framework guarantees convergence to a constrained Nash equilibrium and provides a cumulative violation bound, demonstrating significant improvements in safety and performance in various applications, including UAV mobile-edge computing and autonomous intersection management.
Methodology
TRIDENT employs a co-design approach that integrates three key components: a Richardson-Romberg gradient correction to reduce bias from Gumbel-Softmax estimators, a Lyapunov-constrained sequential trust-region update to maintain safety constraints, and a physics-informed residual critic that focuses on value decomposition. This design ensures that errors from one component do not propagate to others, allowing for a unified convergence and safety analysis.
Results
The empirical results show that TRIDENT significantly reduces training-time violations by 95.5% compared to MADDPG and 76.3% compared to MACPO, while also achieving a 13.5% increase in reward over the best unconstrained baseline. The framework scales effectively to scenarios involving up to 32 agents.
Implications
TRIDENT's approach can be applied to various domains requiring safe coordination in multi-agent systems, such as autonomous vehicles, drone fleets, and robotic warehouses, where safety and efficiency are critical. The framework's ability to handle complex action spaces and ensure safety during training could lead to more reliable and effective deployment of autonomous systems in real-world scenarios.
Online Distributional Prediction via Latent Cluster Geometry Under Drift and Corruption
Theory
Optimization
Time Series
- Introduces a latent clustering-configuration view for online distributional prediction.
- Derives high-probability regret bounds that account for drift and corruption effects.
- Demonstrates that temporal localization of memory can mitigate stale-geometry issues.
- Achieves sublinear cumulative Wasserstein regret without requiring a parametric model.
Read more
Online Distributional Prediction via Latent Cluster Geometry Under Drift and Corruption
Summary
This paper addresses the challenge of online distributional prediction in non-stationary environments where data streams may experience drift and adversarial corruption. Traditional approaches often focus on point estimates, but this work emphasizes the need to predict the full data-generating distribution. The authors propose a novel framework that utilizes a latent cluster geometry to represent candidate distributions through variable-size configurations of cluster centers. This approach allows for a structured representation of probability mass, enabling the derivation of a Gibbs quasi-posterior for online prediction via posterior averaging. The methodology incorporates a reversible-jump MCMC sampling technique to handle the variable-dimensional posterior without requiring a parametric model for the streaming law. The paper evaluates the performance of the proposed method using cumulative Wasserstein-1 regret, separating the effects of corruption and drift on prediction accuracy. A restarted variant of the algorithm is introduced to address the challenges posed by drift, allowing for temporal localization of memory and improved prediction performance. The authors derive high-probability bounds that highlight the impact of drift, corruption, and data dimensionality on prediction accuracy, ultimately demonstrating that the restarted predictor achieves sublinear cumulative Wasserstein regret under certain conditions.
Methodology
The authors develop a latent clustering framework where candidate distributions are represented by configurations of cluster centers. A Gibbs quasi-posterior is used for online prediction, allowing for posterior averaging. The methodology employs reversible-jump MCMC for sampling variable-dimensional posteriors and incorporates a restarted version of the predictor to manage drift in data streams.
Results
The proposed method achieves sublinear cumulative Wasserstein regret under bounded support, stable latent geometry, and controlled drift and corruption. The analysis reveals that the regret bounds decompose into terms related to PAC-Bayesian complexity, corruption sensitivity, and dynamic optimal transport, providing a comprehensive understanding of the factors influencing prediction performance.
Implications
This work has significant implications for applications requiring real-time statistical estimation in dynamic environments, such as anomaly detection, risk control, and online recommendation systems. The ability to predict full distributions rather than point estimates enhances decision-making processes in uncertain and evolving contexts.
P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for Deep Spatiotemporal Super-resolution
Graph Learning
Time Series
Theory
- Introduction of P-K-GCN for spatiotemporal super-resolution.
- Incorporation of physics-based constraints to enhance model fidelity.
- Utilization of Koopman operator theory for linearizing nonlinear dynamics.
- Theoretical guarantees on error reduction through Rademacher complexity.
Read more
P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for Deep Spatiotemporal Super-resolution
Summary
The paper introduces the Physics-augmented Koopman-enhanced Graph Convolutional Network (P-K-GCN), a novel framework designed to address the challenges of reconstructing high-resolution (HR) spatiotemporal dynamics from low-resolution (LR) data, particularly in complex geometries. Traditional data-driven methods often fail to incorporate physical constraints, leading to inaccuracies in dynamic systems characterized by nonlinear behaviors. The proposed P-K-GCN utilizes a continuous spline-based graph convolutional network to effectively capture spatial dependencies from coarse graphs while employing Koopman operator theory to linearize the temporal dynamics in a compact latent space. Additionally, the framework integrates a physics-based loss function to ensure that the reconstructions adhere to physical laws, thereby enhancing predictive fidelity and robustness. The authors provide a theoretical analysis demonstrating that the combination of physics augmentation and Koopman regularization reduces super-resolution error by minimizing Rademacher complexity and tightening generalization bounds. The effectiveness of the P-K-GCN is validated through numerical experiments on cardiac electrodynamics, showcasing superior accuracy compared to baseline models.
Methodology
The P-K-GCN framework employs a continuous spline-based graph convolutional network to extract spatial dependencies from coarse graphs. It integrates Koopman operator theory to project nonlinear dynamics into a compact latent space, where temporal progression is linearized. A physics-based loss function is added to the optimization objective to ensure adherence to physical laws during reconstruction.
Results
Numerical experiments indicate that the P-K-GCN achieves significantly higher accuracy in reconstructing high-resolution cardiac electrodynamics from sparse low-resolution measurements compared to existing baseline models.
Implications
The proposed framework has potential applications in various fields requiring high-fidelity simulations of spatiotemporal dynamics, such as cardiac electrophysiology, environmental monitoring, and other scientific domains where accurate modeling of complex systems is critical.
Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning
Generative Models
Interpretability
- Introduces a Wasserstein GAN-inspired method for unsupervised calibration of sensor systems.
- Demonstrates the ability to recover interpretable degradation parameters from distribution shifts.
- Validates the approach on both a toy model and real-world high-energy physics data.
- Shows improved calibration accuracy and correlation with ground truth aging coefficients.
Read more
Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning
Summary
This paper addresses the challenge of sensor-induced distribution drift in data acquisition systems, particularly in high-energy physics experiments where sensor performance can degrade due to aging and environmental factors. The authors propose a novel calibration approach inspired by Wasserstein Generative Adversarial Networks (Wasserstein GANs) to infer transformation parameters that can map altered detector response distributions back to a nominal reference distribution. The method employs a generator as a learnable calibration transformation, with its weights representing the sought parameters, while a critic provides a distributional distance signal through the Wasserstein objective. The approach is validated using a tracking-detector toy model and high-granularity Geant4-simulated calorimeter data, demonstrating its effectiveness in recovering aging coefficients and improving the agreement between calibrated and reference energy-sum distributions. The results indicate that this adversarial distribution matching technique can enhance calibration strategies in scenarios lacking direct labels for degradation parameters.
Methodology
The authors developed a Wasserstein-inspired adversarial learning framework that treats the calibration process as a distribution-matching problem. The generator learns to transform the altered sensor data back to a reference distribution, while the critic measures the distance between distributions using the Wasserstein objective. This unsupervised approach allows for the inference of physically meaningful parameters without requiring labeled data.
Results
The proposed method successfully recovered aging coefficients for individual cells in the calorimeter data, showing a strong correlation with ground truth values. It also improved the alignment between calibrated and reference energy-sum distributions, although performance degraded with increased channel-to-channel noise levels. These findings suggest that the method is effective for addressing sensor degradation in high-energy physics applications.
Implications
The findings of this research have significant implications for the calibration of sensor systems in experimental physics, where direct measurement of degradation parameters is often not feasible. The proposed method could be adapted for real-time calibration, enhancing the reliability of data-driven analyses in environments with changing sensor conditions.
QueryMarket: Cost-Aware Online Active Learning in Data Markets
Efficient ML
Optimization
Time Series
- Introduces QueryMarket, a novel framework for online active learning that incorporates cost and budget constraints.
- Develops OVBAL, an active learning strategy that estimates label utility and adapts to nonstationary environments.
- Demonstrates the effectiveness of OVBAL in both synthetic and real-world scenarios, particularly in managing costs.
- Addresses the limitations of existing online active learning methods by integrating economic considerations.
Read more
QueryMarket: Cost-Aware Online Active Learning in Data Markets
Summary
The paper addresses the challenge of data acquisition in real-time machine learning environments, where analysts must make immediate decisions on which labels to purchase while adhering to a rolling budget. The authors introduce QueryMarket, a framework that integrates pricing, information gain, and budget constraints in the context of online active learning. They propose OVBAL (online variance-based active learning), which estimates the marginal utility of each data point using a D-optimality criterion and incorporates exponential forgetting to adapt to nonstationary data streams. This approach allows for cost-aware label purchases while managing a cumulative budget. The experiments conducted on both synthetic datasets and a real-world solar power generation forecasting task demonstrate that OVBAL is particularly effective under seller-centric pricing, achieving a favorable long-run error-cost trade-off compared to existing methods.
Methodology
The authors developed QueryMarket, which allows data sellers to provide a willingness-to-sell (WTS) for labels, while analysts form a willingness-to-pay (WTP) based on estimated utility. OVBAL employs a D-optimality criterion for utility estimation and uses a Newton-Raphson scheme with exponential forgetting to perform sequential updates, enabling real-time adaptation to changing data distributions.
Results
Experiments reveal that OVBAL effectively balances the trade-off between label acquisition costs and model accuracy, particularly under seller-centric pricing. In the real-world solar power forecasting task, OVBAL achieved a more favorable long-run error-cost trade-off compared to existing online active learning methods.
Implications
The findings suggest that QueryMarket and OVBAL can significantly enhance the efficiency of data acquisition in real-time machine learning applications, particularly in domains where labeling costs are a critical factor. This approach can be applied to various fields, including finance, healthcare, and environmental monitoring, where timely and cost-effective data labeling is essential.
Explaining Attention with Program Synthesis
NLP
Large Language Models
Interpretability
- Introduces program synthesis as a method for interpreting attention mechanisms in transformer models.
- Demonstrates that a substantial fraction of attention heads can be approximated by executable programs.
- Shows that replacing attention heads with synthesized programs incurs minimal performance loss.
- Highlights the potential for causal validation and model editing using symbolic representations.
Read more
Explaining Attention with Program Synthesis
Summary
This paper addresses the challenge of interpreting deep neural networks, particularly focusing on transformer models and their attention mechanisms. The authors propose a novel approach that utilizes program synthesis to generate executable code that approximates the computations performed by attention heads in language models. By extracting attention maps from the models, they prompt a language model to synthesize candidate Python programs that can reproduce these attention patterns. The synthesized programs are then evaluated and ranked based on their ability to match the original attention maps using Jensen-Shannon distance. The results indicate that a significant proportion of attention heads can be effectively approximated with these symbolic programs, allowing for causal validation and potential model editing. The findings suggest that even complex language models can be understood in symbolic terms, enabling the replacement of neural components with symbolic surrogates without significantly affecting model performance.
Methodology
The authors extract attention maps from transformer models and use a language model to generate candidate Python programs that approximate these maps. They evaluate the generated programs using Jensen-Shannon distance to determine their accuracy in reproducing the attention patterns. The best candidates are refined and tested for their ability to replace the original attention heads in the models.
Results
The study found that approximately 25% of attention heads in various transformer models could be replaced with synthesized programs, resulting in only a 16% increase in perplexity and no significant impact on downstream question answering performance across multiple benchmarks.
Implications
This work opens new avenues for understanding and interpreting deep learning models, particularly in NLP. The ability to replace neural components with symbolic programs could lead to more interpretable AI systems and facilitate model modifications based on symbolic reasoning.
The Illusion of Improvement: Reject Inference Strategies in Credit Scoring
Theory
Interpretability
- Identification of a structural failure mode in credit scoring models where accuracy improvement masks recall deterioration.
- Proposal of a controlled exploration strategy to mitigate survival bias without statistical assumptions.
- Demonstration that standard evaluation metrics can mislead practitioners regarding model performance.
- Minimal exploration rates (2-5%) can effectively assess the feedback loop's severity at low cost.
Read more
The Illusion of Improvement: Reject Inference Strategies in Credit Scoring
Summary
This paper investigates the effectiveness of reject inference methods used in credit scoring to address survival bias. The authors identify a critical failure mode where models may show improved accuracy while their recall deteriorates, creating a misleading perception of improvement. They propose a controlled exploration strategy that allows lenders to approve a small fraction of rejected applicants to observe their true outcomes, thereby breaking the feedback loop caused by survival bias. The study demonstrates that traditional evaluation metrics can be misleading under selection bias, as accuracy and rejection quality often provide conflicting recommendations regarding the need for exploration. The authors find that even minimal exploration rates (2-5%) can effectively diagnose the severity of the feedback loop at minimal cost. Their findings are consistent across multiple machine learning methods and real-world datasets, highlighting the inadequacy of standard evaluation protocols for models trained under survival bias.
Methodology
The authors simulate a bank lending scenario using real data to measure the impact of survival bias on model performance. They compare the performance of credit scoring models against an Oracle model that has access to complete data. The controlled exploration strategy involves deliberately approving a fraction of rejected applicants to observe their outcomes, allowing for an assessment of the feedback loop's severity.
Results
The study reveals that models can exhibit improved accuracy while their ability to reject defaulters (recall) declines, creating an illusion of improvement. The controlled exploration method shows that even small exploration rates can effectively diagnose the feedback loop's severity, confirming that traditional evaluation metrics are inadequate under survival bias.
Implications
The findings suggest that credit scoring models and other predictive tools should incorporate controlled exploration strategies to better understand and mitigate the effects of survival bias. This approach can lead to more accurate assessments of model performance and potentially reduce discriminatory impacts in high-stakes decision-making contexts.
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity
Theory
- Functional equivalence in attention mechanisms is more complex than in traditional architectures.
- Sinusoidal positional encodings preserve the symmetry of vanilla attention, while RoPE enhances expressivity by reducing symmetry.
- Positional encodings significantly affect linear mode connectivity in Transformers.
- An alignment algorithm demonstrates the dependency of connectivity on the choice of positional encoding.
Read more
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity
Summary
This paper investigates the concept of functional equivalence in attention-based neural networks, particularly Transformers, and its implications for linear mode connectivity (LMC). The authors highlight that while functional equivalence is well understood in traditional neural architectures, it becomes complex in modern attention mechanisms due to the introduction of positional encodings. The study focuses on two prevalent types of positional encodings: sinusoidal and rotary positional encodings (RoPE). The authors demonstrate that sinusoidal encodings maintain the symmetry structure of vanilla attention, while RoPE reduces this symmetry group, enhancing the model's expressivity. This finding provides a theoretical basis for the increasing adoption of RoPE in practical applications. Furthermore, the paper explores how different positional encodings influence linear mode connectivity, revealing that the connectivity between models trained independently is significantly affected by the choice of positional encoding. An alignment algorithm is employed to empirically validate these claims, showing that the variability of connectivity across Transformer configurations is closely tied to the positional encoding used.
Methodology
The authors conducted a formal analysis of functional equivalence in Transformers, focusing on sinusoidal and rotary positional encodings. They employed an alignment algorithm to empirically investigate the effects of positional encodings on linear mode connectivity, analyzing how different configurations impact the connectivity between independently trained models.
Results
The study found that sinusoidal positional encodings maintain the equivalence structure of attention mechanisms, while RoPE significantly reduces this structure, leading to greater expressivity. The empirical results indicated that the presence and variability of linear mode connectivity across different Transformer settings are critically influenced by the type of positional encoding used.
Implications
The findings suggest that the choice of positional encoding can have profound effects on the performance and generalization capabilities of Transformer models. This has practical implications for model design in natural language processing and other applications utilizing attention mechanisms.
ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
Theory
- ResAware improves the robustness of Website Fingerprinting models in real-world environments.
- The framework utilizes a training-rich/inference-poor approach to leverage resource-level features.
- Significant performance improvements were observed under temporal, spatial, and browser variations.
- The method enhances existing WF models without increasing the online attack capabilities.
Read more
ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
Summary
The paper addresses the limitations of Website Fingerprinting (WF) attacks, which perform well in controlled environments but struggle in real-world scenarios due to various factors like spatio-temporal drift and browser heterogeneity. The authors propose ResAware, a novel framework that utilizes resource-level features to enhance the robustness of WF models across different environments. ResAware operates under a training-rich/inference-poor paradigm, where a teacher model is trained on resource-level data, and its knowledge is distilled into a student model that only uses encrypted traffic during inference. This approach allows for improved performance without increasing the attack surface. The framework was evaluated on a large dataset collected over five months from multiple locations, demonstrating significant improvements in the robustness of WF models against various distribution shifts. The results indicate that ResAware can effectively enhance the cross-environment generalization of WF baselines, achieving better accuracy without additional inference costs.
Methodology
ResAware employs a two-step process involving a teacher model trained on resource-level features and a student model that distills this knowledge for use with encrypted traffic. The training phase utilizes paired traffic-resource samples, while the inference phase relies solely on encrypted traffic, maintaining a standard passive eavesdropper model.
Results
The evaluation showed that ResAware significantly enhances the F1-score of the Var-CNN model from 72.77% to 81.49% and improves the open-world True Positive Rate at 1% False Positive Rate from 22.40% to 27.20% under a 150-day temporal drift. These results highlight the effectiveness of resource-level supervision in improving WF robustness.
Implications
The findings suggest that incorporating resource-level information can lead to more reliable website fingerprinting techniques, which could have implications for enhancing privacy and security in encrypted web communications. Additionally, the framework could be adapted for other applications requiring robust performance across varying environments.
Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics
Optimization
- Optimization of both final and setup pitches can significantly influence season-level performance metrics.
- A Transformer-based model was developed to predict pitch outcomes based on contextual information.
- Counterfactual analyses revealed that altering pitch sequences can lead to substantial improvements in K/9 statistics.
- Insights on effective pitch locations and the role of pitch command were identified, enhancing strategic decision-making.
Read more
Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics
Summary
This study investigates the optimization of baseball pitch sequences and their impact on season-level performance metrics. Previous research has primarily focused on optimizing the final pitch of a plate appearance, neglecting the influence of preceding setup pitches. To address this gap, the authors employed counterfactual analyses using MLB Statcast data and developed a Transformer-based machine learning model to predict the likelihood of a pitch resulting in an in-play outcome or a swing-out. By generating counterfactual pitch sequences that replace either the final pitch or the preceding setup pitch, the study aimed to minimize the predicted in-play probability. The results indicate that optimizing both final and setup pitches can significantly enhance season-level statistics, with improvements exceeding 1.0 in K/9 (strikeouts per nine innings). Additionally, the analysis revealed practical insights regarding effective pitch locations, the importance of pitch command, and the potential benefits of incorporating middle-velocity pitches into pitch selection strategies. Overall, the findings underscore the strategic significance of pitch sequencing in baseball analytics and provide quantitative support for optimizing pitch strategies to improve performance.
Methodology
The authors utilized MLB Statcast data to train a Transformer-based machine learning model for predicting pitch outcomes. Counterfactual analyses were conducted by generating alternative pitch sequences, and regression models were employed to estimate the expected effects on pitchers' seasonal statistics.
Results
The study found that optimizing pitch sequences, including both final and setup pitches, could lead to significant improvements in season-level statistics, particularly an increase of over 1.0 in K/9. The analysis also provided insights into effective pitch locations and the importance of pitch command.
Implications
The findings suggest that teams can enhance pitching strategies by focusing on pitch sequencing, potentially leading to improved performance metrics. This research provides a quantitative framework for evaluating and optimizing pitch strategies in baseball analytics.
Complementary Attention Head Pruning for Efficient Transformers
NLP
Efficient ML
Graph Learning
- CAHP introduces a global graph-based approach to attention head selection for Transformers.
- The framework eliminates the need for predefined pruning ratios by automatically determining optimal head retention based on performance metrics.
- CAHP outperforms existing methods, particularly in high-compression settings, while maintaining model accuracy.
- The method avoids the 'proximity bias' seen in gradient-based pruning, ensuring diverse functional head retention across layers.
Read more
Complementary Attention Head Pruning for Efficient Transformers
Summary
This paper presents Complementary Attention Head Pruning (CAHP), a novel framework aimed at enhancing the efficiency of Transformer models by optimizing the selection of attention heads. The authors identify that existing structured pruning methods often rely on gradient-based importance ranking, which can lead to instability and require extensive manual tuning. CAHP approaches head selection as a global graph-theoretical problem, utilizing graph-based clustering and information-theoretic distance measures to preserve a diverse set of complementary attention heads. The framework automatically determines the optimal number of heads to retain based on a diminishing marginal performance curve, eliminating the need for predefined sparsity levels. Evaluations on the SST-5 and MNLI benchmarks demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression scenarios. The structural analysis reveals that CAHP effectively avoids the 'proximity bias' of gradient-based methods, ensuring critical heads in intermediate layers are retained. Overall, CAHP provides a principled and automated solution for model compression, making it suitable for deployment in resource-constrained environments.
Methodology
CAHP employs a graph-theoretical framework for attention head selection, utilizing graph-based clustering and information-theoretic measures to identify complementary heads. It automatically detects the optimal number of heads to retain based on a diminishing marginal performance curve, allowing for cross-layer redistribution without manual tuning.
Results
Extensive evaluations on the SST-5 and MNLI benchmarks show that CAHP consistently outperforms competitive baselines, especially in high-compression scenarios. The framework effectively retains functionally critical heads in intermediate layers, leading to superior performance compared to traditional gradient-based pruning methods.
Implications
The findings suggest that CAHP can significantly enhance the deployment of Transformer models in resource-constrained environments by reducing their size while maintaining performance. This approach could facilitate broader adoption of Transformer architectures in applications where computational resources are limited.
Discrete Autoregressive Transformer for Generative Mechanism Synthesis
Generative Models
Robotics
Optimization
- Introduces a generative approach to mechanism synthesis using a discrete autoregressive transformer.
- Addresses the limitations of traditional optimization methods by generating multiple mechanisms for a given coupler curve.
- Utilizes a large dataset of over one million mechanisms to train the model effectively.
- Achieves competitive performance metrics compared to existing methods, demonstrating the efficacy of the proposed approach.
Read more
Discrete Autoregressive Transformer for Generative Mechanism Synthesis
Summary
This paper addresses the challenge of mechanism synthesis for planar path generation, where the goal is to design mechanisms that trace prescribed coupler curves. Traditional optimization methods often yield a single mechanism and struggle with the one-to-many nature of the problem. The authors propose a novel approach using a conditional autoregressive sequence modeling framework, employing a decoder-only transformer combined with a variational autoencoder (VAE) to generate diverse mechanisms. The synthesis process involves quantizing joint coordinates into tokens and training the model with a combination of token cross-entropy and a Gaussian-smoothed auxiliary loss. During inference, a bounded latent-noise schedule allows the generation of multiple mechanism types, retaining the top candidates based on geometric error. The results demonstrate significant improvements over traditional methods, achieving a mean Chamfer distance of 0.0132 and mean dynamic time warping of 0.153, while a latent k-nearest-neighbor baseline achieved slightly better metrics. This work highlights the potential of generative models in mechanical design, offering a more efficient and diverse synthesis process.
Methodology
The authors formulated the synthesis problem as conditional autoregressive sequence modeling. They quantized joint coordinates into tokens and employed a decoder-only transformer architecture, integrating a variational autoencoder (VAE) latent representation of the target curve and an explicit mechanism-type token. The training process combined token cross-entropy loss with a Gaussian-smoothed auxiliary loss to respect ordinal structures among bins. During inference, a latent-noise schedule was used to decode various mechanism types, allowing for the selection of the best candidates based on geometric accuracy.
Results
The proposed method achieved an aggregate mean Chamfer distance of 0.0132 and a mean dynamic time warping of 0.153 on held-out tests. In comparison, a baseline using latent k-nearest neighbors in VAE space achieved a mean Chamfer distance of 0.0071 and mean dynamic time warping of 0.117, indicating the effectiveness of the generative approach in producing diverse and accurate mechanisms.
Implications
This research has significant implications for the field of mechanical engineering, particularly in the design of robotic systems and other applications requiring complex motion generation. By leveraging machine learning for generative design, engineers can explore a wider array of feasible solutions, accommodating various constraints and enhancing the efficiency of the design process.
Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow
Theory
Robotics
Multimodal
- Formalizes Objective Interference Collapse (OIC) as a failure mode in joint latent world modeling.
- Proposes DCGWM architecture with partitioned latent space to prevent OIC.
- Introduces Asymmetric Grounding Adherence Loss for managing rollout drift.
- Establishes theoretical results supporting the architecture's structural properties.
Read more
Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow
Summary
This paper introduces Dual-Channel Grounded World Modeling (DCGWM), a novel architecture aimed at addressing a critical failure mode in Joint Embedding Predictive Architectures (JEPAs) known as Objective Interference Collapse (OIC). OIC occurs when a world model attempts to ground its latent space using two distinct external signals: physical dynamics and social-behavioral dynamics, which have incompatible gradient structures. The proposed DCGWM architecture prevents OIC by partitioning the latent space into two separate channels—one for physical grounding and another for social-behavioral grounding—allowing for inward-only gradient flow. This design ensures that updates to one channel do not interfere with the other, thereby preserving the integrity of both representations. Additionally, the paper introduces the Asymmetric Grounding Adherence Loss, which addresses rollout drift by applying different penalties for physical and behavioral discrepancies. The manuscript also presents theoretical results that support the structural properties of the DCGWM architecture, including the removal of gradient interference pathways and the necessity of generative isolation. While experimental validation is still ongoing, the paper lays a foundational framework for future research in multi-objective latent modeling.
Methodology
The DCGWM architecture employs a partitioned latent space consisting of two channels: one for physical grounding and another for social-behavioral grounding. Each channel updates independently using inward-only gradient flow, preventing interference. The Asymmetric Grounding Adherence Loss is applied to control drift in predictions, with distinct penalties for physical and behavioral violations.
Results
The paper presents theoretical results demonstrating that the partitioned architecture effectively removes the gradient interference pathway associated with OIC. It also shows that each grounded subspace maintains anti-collapse guarantees due to its alignment objectives, and that generative isolation is crucial under certain assumptions regarding the generative objective's geometry.
Implications
The findings suggest that DCGWM could significantly enhance the performance of world models in environments requiring simultaneous representation of physical and social dynamics. This has potential applications in robotics, autonomous systems, and multi-agent simulations where accurate modeling of complex interactions is critical.
Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods
Optimization
Theory
Efficient ML
- Stochastic momentum methods like HB and ASGD show distinct tradeoffs between compute efficiency and serial runtime.
- HB maintains SGD-level compute efficiency over a larger batch-size window, allowing for reduced serial runtime.
- ASGD outperforms HB in small-batch scenarios but trades off compute efficiency for improved serial runtime at larger batch sizes.
- The paper provides a theoretical framework for understanding the performance of these methods under varying spectral conditions.
Read more
Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods
Summary
This paper investigates the tradeoffs between compute efficiency (CE) and serial runtime in stochastic momentum methods, specifically focusing on heavy ball (HB) and accelerated stochastic gradient descent (ASGD) for linear regression with Gaussian covariates. The authors establish lower bounds on batch-size tradeoffs, demonstrating that while HB does not enhance the CE frontier compared to stochastic gradient descent (SGD), it allows for larger batch sizes to reduce serial runtime until it reaches a deterministic accelerated scale. The study reveals that for rapidly decaying power-law spectra, ASGD can improve small-batch CE over HB/SGD, but as batch size increases, it sacrifices this advantage for better serial runtime. The findings are supported by synthetic linear regression experiments that validate the theoretical predictions regarding the CE-serial tradeoff across different spectral regimes.
Methodology
The authors analyze the performance of stochastic momentum methods by deriving finite-dimensional, discrete-time lower bounds on batch-size tradeoffs for HB and ASGD. They conduct synthetic linear regression experiments to validate their theoretical predictions regarding compute efficiency and serial runtime across different spectral regimes.
Results
The study finds that HB does not improve the compute efficiency frontier over SGD but allows for a larger batch-size window to reduce serial runtime. For ASGD, the results indicate that while it improves compute efficiency for small batches, it eventually sacrifices this advantage for better serial runtime as batch sizes increase. The empirical experiments confirm the theoretical predictions regarding the CE-serial tradeoff.
Implications
The findings suggest that practitioners should consider the specific spectral properties of their optimization problems when choosing between stochastic momentum methods. Understanding the tradeoffs can lead to more efficient training strategies in large-scale machine learning applications, particularly in deep learning contexts.
Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction
Reinforcement Learning
Optimization
Time Series
- Introduction of Quantum Annealing Enhanced Q-learning (QAQL) for RUL prediction.
- Q-value updates are reformulated as QUBO problems solved on a quantum processor.
- Stochastic action selection from quantum annealer enhances exploration in reinforcement learning.
- QAQL outperforms 14 classical and quantum baselines across multiple error metrics.
Read more
Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction
Summary
This paper presents a novel approach to Remaining Useful Life (RUL) prediction by integrating quantum annealing with reinforcement learning, specifically Q-learning. Traditional methods for RUL estimation often struggle with the nonlinear behaviors of real systems, and while data-driven machine learning models have improved accuracy, they can converge to suboptimal solutions in complex search spaces. The proposed Quantum Annealing Enhanced Q-Learning (QAQL) framework leverages the sampling capabilities of quantum annealing to enhance the exploration aspect of Q-learning. Each Q-value update is formulated as a Quadratic Unconstrained Binary Optimization (QUBO) problem, which is solved using the D-Wave Advantage quantum processor. This method allows for a distribution of near-optimal actions, reducing the risk of premature convergence on nonlinear degradation paths. The framework was validated on two public datasets: the NASA C-MAPSS turbofan engine dataset and a predictive maintenance dataset, demonstrating significant improvements in prediction accuracy over classical and quantum baselines.
Methodology
The study employs a quantum annealing enhanced Q-learning framework where each Q-value update is encoded as a QUBO problem. The D-Wave Advantage quantum processor is utilized to solve these problems, allowing for sampling of near-optimal actions rather than deterministic solutions. This stochastic approach facilitates better exploration of the action space, addressing the convergence issues typically faced in classical reinforcement learning methods.
Results
The QAQL framework achieved mean squared errors (MSE) of 435.28 ± 12.4, 593.69 ± 50.22, 549.54 ± 14.24, and 880.59 ± 260.68 on the C-MAPSS turbofan engine datasets (FD001, FD002, FD003, FD004 respectively) and an MSE of 126.28 ± 4.1 on the predictive maintenance dataset. These results indicate a significant improvement over classical and quantum baselines evaluated under the same conditions, with statistical significance (p < 0.01) across six error metrics.
Implications
The findings suggest that quantum annealing can be effectively integrated into reinforcement learning frameworks for industrial applications, particularly in predictive maintenance. This approach not only enhances the accuracy of RUL predictions but also demonstrates the practical utility of quantum computing in solving complex optimization problems in real-world scenarios.
Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry
Computer Vision
Robotics
Interpretability
- Identifies limitations of existing single-shot FPP methods in long-range settings.
- Introduces a novel architecture, PhiCalNet, that improves depth reconstruction accuracy.
- Demonstrates the effectiveness of mechanistic interpretability and uncertainty quantification in diagnosing and repairing model errors.
- Achieves a significant reduction in mean absolute error for 3D reconstruction tasks.
Read more
Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry
Summary
This paper addresses the challenges of long-range single-shot fringe projection profilometry (FPP), a technique for 3D reconstruction that has primarily been studied in close-range settings. The authors identify key issues such as the degradation of fringe signal-to-noise ratio and the ill-posed nature of single-shot reconstruction at distances greater than 1 meter. They propose a unified approach that combines mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) to diagnose and repair architectural shortcomings in existing models. The study systematically evaluates a photorealistic synthetic benchmark consisting of 15,600 fringe images across 50 objects at varying distances. The authors find that existing architectures rely on shape priors rather than effective fringe-phase decoding. To address this, they introduce PhiCalNet, which outputs a wrapped-phase representation instead of depth, effectively removing shape-prior shortcuts. The results show a significant reduction in mean absolute error (MAE) from 14.54 mm to 4.46 mm, with a minimal residual error attributed to structural limits. The application of MI and UQ confirms the effectiveness of the architectural changes, demonstrating that rejecting certain pixels can further reduce error. This work lays the groundwork for more reliable long-range FPP applications.
Methodology
The authors employed a diagnose-repair-verify framework that integrates mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) to identify and address architectural failures in existing FPP models. They conducted systematic ablation studies on a synthetic dataset to evaluate model performance and introduced PhiCalNet to enhance depth reconstruction capabilities.
Results
The study established a baseline UNet model with a mean absolute error (MAE) of 14.54 mm, which was improved to 4.46 mm using PhiCalNet. The residual error was primarily due to pixel discontinuities. Additionally, applying MI and UQ led to a 64% reduction in root-mean-square error (RMSE) when rejecting the top 5% of object pixels based on snapshot disagreement.
Implications
The findings suggest that the proposed methods can significantly enhance the accuracy of long-range 3D reconstruction techniques, making them more applicable in real-world scenarios such as industrial inspection and robotic scanning. The framework established for diagnosing and repairing model errors can also be applied to other machine learning tasks in computer vision.
Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning
Theory
Optimization
Efficient ML
- Introduction of a normalized variance aggregation for evaluating multi-output PCE models.
- Development of a sequential adaptive sampling method that balances variance contribution and spatial exploration.
- Demonstration of improved accuracy and stability in surrogate modeling for engineering structures.
- Comparison with traditional sampling methods, highlighting the advantages of the proposed approach.
Read more
Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning
Summary
This paper presents a novel approach for uncertainty quantification (UQ) in engineering structures using Polynomial Chaos Expansion (PCE) combined with multivariate active learning techniques. The authors address the challenge of efficiently sampling multiple quantities of interest (QoIs) produced by high-fidelity models, which often exhibit varying sensitivities to input parameters. Traditional sampling methods, such as Latin Hypercube Sampling, may not adequately capture the correlations among outputs, leading to inefficiencies and increased computational costs. The proposed method employs a sequential adaptive sampling strategy that selects new samples based on their contribution to output variance while balancing exploration of the input space. This approach is designed to improve the accuracy and stability of surrogate models, particularly in multi-output scenarios. The effectiveness of the method is demonstrated through numerical examples, showcasing its ability to provide reliable estimates of second-order statistics and enhance the overall performance of UQ in engineering applications.
Methodology
The authors generalize a multi-output sequential adaptive sampling method for constructing polynomial chaos expansion surrogate models. This method selects new samples based on their local contribution to output variance, while also considering the spatial exploration of the input space. The performance of this adaptive sampling approach is compared against non-sequential Latin Hypercube Sampling through various numerical examples.
Results
The numerical results indicate that the proposed adaptive sampling strategy significantly enhances the accuracy and stability of surrogate models compared to traditional methods. It provides a more reliable estimation of second-order statistics, demonstrating its effectiveness in handling multi-output problems in engineering applications.
Implications
This research has significant implications for the field of engineering, particularly in the design and analysis of structures where uncertainty plays a critical role. The proposed method can lead to more efficient and accurate assessments of structural performance, safety, and reliability, ultimately informing better engineering decisions and practices.
Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion
Graph Learning
- Introduction of the K-Hop Gaussian (KHG) diffusion kernel to enhance GNNs.
- KHG allows for multi-hop diffusion with Gaussian weighting, balancing local and global information.
- Demonstrated superiority of KHG over traditional GNNs and existing diffusion kernels in noisy and complex graphs.
- KHG serves as a modular, plug-and-play component for existing GNN architectures.
Read more
Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion
Summary
This paper addresses the limitations of traditional Graph Neural Networks (GNNs) that primarily rely on single-hop message passing, which can lead to issues such as over-smoothing and poor information propagation in the presence of noisy or complex graph structures. The authors propose a novel K-Hop Gaussian (KHG) diffusion kernel that serves as a preprocessing module to enhance GNN performance. KHG enables multi-hop diffusion with Gaussian weighting, effectively balancing local and global information propagation. The method is designed to suppress noise from distant nodes while improving robustness compared to existing diffusion kernels like Personalized PageRank (PPR) and Heat Kernel. The authors conduct extensive experiments across various benchmark datasets, demonstrating that KHG significantly outperforms traditional message-passing GNNs and existing diffusion methods, particularly in challenging scenarios characterized by noise and complex structures.
Methodology
The K-Hop Gaussian (KHG) diffusion method involves constructing a diffusion matrix from the adjacency matrix of the graph, normalizing it to ensure stability, and applying Gaussian weights to facilitate multi-hop information propagation. This approach allows for controlled information decay with distance, addressing over-smoothing and noise issues in GNNs.
Results
The experimental results indicate that the KHG diffusion kernel significantly outperforms traditional GNNs and existing diffusion methods like PPR and Heat Kernel across multiple benchmark datasets, particularly in scenarios with noisy or structurally complex graphs.
Implications
The KHG diffusion kernel can enhance the performance of GNNs in various applications, including social networks, transportation systems, and biological networks, where graph structures may be noisy or complex. Its modular design allows for easy integration into existing GNN frameworks, potentially leading to broader adoption and improved outcomes in graph-based learning tasks.
Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows
Theory
Efficient ML
Interpretability
- Introduces a deep learning framework for CO2 retrieval that significantly reduces computational time.
- Utilizes high-fidelity simulation data to account for model errors and improve accuracy.
- Implements Laplace approximations and normalizing flows for enhanced uncertainty quantification.
- Demonstrates superior predictive accuracy compared to existing operational methods.
Read more
Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows
Summary
This paper presents a novel deep learning framework for the retrieval of atmospheric carbon dioxide (CO2) from NASA's Orbiting Carbon Observatory-2 (OCO-2) spectra, addressing the computational inefficiencies and uncertainty quantification issues of current operational algorithms. The authors propose an amortized probabilistic inference approach that utilizes a high-fidelity simulation dataset to train the model, incorporating realistic forward model errors. The architecture employs a multi-branch neural network to encode spectral bands and estimates posterior distributions of CO2 concentrations using Laplace approximations and normalizing flows. The framework offers significant advantages over traditional methods, including faster inference times, robustness to model errors, improved predictive accuracy, better uncertainty quantification, and the ability to model complex non-Gaussian posteriors. The results indicate that this simulation-based deep learning approach could pave the way for next-generation operational processing systems for atmospheric monitoring.
Methodology
The authors developed a multi-branch neural network architecture that processes spectral bands from OCO-2 data. They employed two scalable uncertainty quantification methods—Laplace approximations and normalizing flows—to estimate the posterior distributions of CO2 concentrations. The model was trained on a high-fidelity simulation dataset that included realistic forward model errors, allowing for robust inference.
Results
The proposed framework achieved orders of magnitude faster inference times compared to traditional retrieval methods, while also providing improved accuracy in point estimates and better-calibrated uncertainty estimates. The use of normalizing flows allowed the model to effectively capture complex, asymmetric posterior distributions.
Implications
This research suggests that deep learning can be effectively integrated into atmospheric monitoring systems, potentially leading to real-time processing capabilities and more reliable estimates of CO2 concentrations. The findings may influence future satellite missions and climate monitoring efforts.
Learning to Refine Hidden States for Reliable LLM Reasoning
Large Language Models
Reinforcement Learning
NLP
- ReLAR introduces an iterative hidden-state refinement framework for LLMs.
- The framework allows direct control over internal reasoning trajectories before generating outputs.
- Reinforcement-learning-based controllers dynamically adjust the refinement process based on task complexity.
- Experiments show improved coherence and reliability in reasoning tasks with lower inference costs.
Read more
Learning to Refine Hidden States for Reliable LLM Reasoning
Summary
This paper addresses the instability of reasoning in large language models (LLMs) during complex multi-step tasks, where early errors in hidden states can lead to incorrect predictions. The authors propose ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations before decoding. ReLAR maintains a compact latent reasoning state and employs learned depth and action controllers to adaptively determine the number and direction of refinement steps. These controllers are trained using a policy-gradient objective focused on step-wise likelihood improvement, allowing for efficient reasoning that does not rely on explicit chain-of-thought generation. The experiments conducted on various benchmarks, including medical, mathematical, multi-hop reasoning, and open-ended generation tasks, demonstrate that ReLAR significantly enhances accuracy, generation quality, and reasoning stability while reducing inference overhead compared to traditional explicit reasoning methods.
Methodology
The proposed ReLAR framework utilizes reinforcement learning to guide the iterative refinement of hidden states in LLMs. It employs learned controllers that determine the number and direction of refinement steps based on the complexity of the task, allowing for adaptive reasoning depth. This approach contrasts with traditional methods that rely on fixed hidden states and explicit reasoning outputs.
Results
The results indicate that ReLAR outperforms baseline models in terms of accuracy and reasoning stability across various benchmarks. It achieves better step-level coherence and overall reliability while incurring significantly lower inference overhead compared to explicit reasoning methods, such as chain-of-thought prompting.
Implications
The findings suggest that ReLAR could enhance the deployment of LLMs in high-stakes environments, such as clinical decision-support systems, by providing more reliable and interpretable reasoning capabilities. This approach may lead to broader applications in areas requiring complex reasoning under uncertainty.
Fair Cognitive Impairment Detection Through Unlearning
Multimodal
Audio & Speech
- Introduction of FMD, a fair MCI detection framework that combines multiple modalities.
- Utilization of cross-attention fusion for better interaction between speech, text, and image data.
- Implementation of an unlearning mechanism to mitigate demographic biases in model predictions.
- Demonstrated improved performance on multilingual benchmarks while reducing subgroup disparities.
Read more
Fair Cognitive Impairment Detection Through Unlearning
Summary
This paper addresses the challenge of detecting Mild Cognitive Impairment (MCI) from spontaneous speech, which is promising for scalable screening but often suffers from biases related to demographic factors. The authors propose a multi-modal framework called FMD that integrates speech, text, and image modalities through cross-attention fusion. A key innovation is the incorporation of an unlearning mechanism that uses gradient reversal to prevent the model from encoding demographic attributes that do not contribute to the task of MCI detection. The framework is evaluated on two multilingual benchmarks, TAUKADIAL and PREPARE, demonstrating superior performance compared to existing state-of-the-art methods while significantly reducing performance disparities across demographic subgroups. The results indicate that the unlearning approach not only enhances the robustness of the model but also facilitates better generalization across different datasets, thereby addressing fairness in cognitive assessment.
Methodology
The FMD framework employs a multi-modal MCI classifier that utilizes cross-attention fusion to integrate features from speech, text, and images. An unlearning module is incorporated to remove demographic information from the learned representations, using an auxiliary demographic classifier to identify and reverse the influence of spurious demographic features.
Results
FMD outperformed state-of-the-art multilingual and multi-modal baselines in MCI classification, achieving higher average F1 scores on both TAUKADIAL and PREPARE datasets. The framework also significantly reduced performance gaps across demographic subgroups, particularly in sex and language, indicating improved fairness and robustness.
Implications
The findings suggest that integrating unlearning techniques in machine learning models can enhance fairness and reliability in cognitive assessments, particularly in diverse populations. This approach may be applicable in other domains where demographic biases affect model performance.
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
NLP
Large Language Models
Efficient ML
- PowerOPD addresses severe training pathologies in standard on-policy distillation.
- The method employs a Box-Cox power transformation to create bounded rewards.
- PowerOPD achieves significant accuracy gains and sample efficiency improvements.
- The approach reduces wall-clock time and peak GPU memory usage compared to traditional methods.
Read more
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
Summary
The paper introduces PowerOPD, a novel approach to on-policy distillation (OPD) for large language models that addresses significant training challenges associated with standard OPD methods. The authors identify that the conventional OPD suffers from severe pathologies, including sample inefficiency, unstable generation dynamics, and a notable performance gap when compared to full-vocabulary OPD. These issues stem from the unbounded nature of the log-ratio reward used in OPD, which leads to high-variance gradients and instability during training. To mitigate these problems, PowerOPD employs a Box-Cox power transformation to create bounded and sign-consistent rewards. This transformation ensures that the reward remains stable and manageable throughout the training process. The authors evaluate PowerOPD across six mathematical reasoning benchmarks and four teacher-student pairs from the Qwen3 family, demonstrating significant improvements in accuracy and efficiency. PowerOPD not only enhances performance metrics but also reduces training time and memory usage, making it a promising advancement in the field of on-policy distillation for large language models.
Methodology
PowerOPD reformulates the reward structure in on-policy distillation using the Box-Cox power transformation, which allows for bounded and sign-consistent rewards. This approach prevents extreme reward values from dominating the training process, thus stabilizing the optimization dynamics. The method was evaluated using various benchmarks and teacher-student model pairs to assess its effectiveness.
Results
PowerOPD demonstrated benchmark-averaged gains of up to +6.37 in Avg@8 and +5.71 in Pass@8 over vanilla OPD, and outperformed post-hoc stabilization and full-vocabulary OPD by +3.01/+3.54 and +2.59/+8.90, respectively. Additionally, it reduced wall-clock time by 59.2% and peak GPU memory by 23.1%. The method also achieved a +9.6 accuracy gain over vanilla OPD with 10 times fewer training steps.
Implications
The findings suggest that PowerOPD could significantly enhance the efficiency and effectiveness of training large language models, making it a valuable tool for researchers and practitioners in the field. Its ability to stabilize training dynamics and improve performance metrics could lead to broader applications in natural language processing and other domains reliant on large-scale model training.
ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics
Time Series
- ASTEROID predicts multi-step atomic coordinates directly, bypassing traditional iterative integration methods.
- The framework integrates Spatiotemporal Information Transformation into a Transformer architecture to model complex dependencies.
- ASTEROID demonstrates superior accuracy and reduced computational costs compared to existing forecasting methods.
- The model supports iterative multi-step forecasting over extended time scales, enhancing its practical applicability.
Read more
ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics
Summary
The paper presents ASTEROID, a novel data-driven framework designed to predict multi-step atomic coordinates in molecular dynamics (MD) simulations, addressing the computational challenges associated with traditional methods. ASTEROID reformulates MD trajectories into high-dimensional spatiotemporal sequences and employs a Transformer architecture integrated with a Spatiotemporal Information (STI) Transformation equation. This innovative approach captures both local and global spatial dependencies through a self-attention mechanism, while an encoder-decoder structure manages temporal dependencies, enabling autoregressive forecasting. The authors evaluated ASTEROID on several quantum-mechanics derived molecular datasets, demonstrating that it significantly outperforms existing methods in terms of accuracy and computational efficiency. The framework not only enhances the predictive capabilities for long-term forecasts but also establishes a robust paradigm for accelerating MD simulations, making it a valuable tool for researchers in the field.
Methodology
ASTEROID utilizes a Transformer architecture that incorporates the Spatiotemporal Information (STI) Transformation equation to reformulate molecular dynamics trajectories as high-dimensional spatiotemporal sequences. It employs a local-global self-attention mechanism for spatial dependencies and an encoder-decoder structure for temporal dependencies, allowing for effective multi-step forecasting.
Results
ASTEROID outperformed various baseline and deep learning methods in multi-step forecasting tasks, achieving higher accuracy and robustness across multiple real-world datasets derived from quantum mechanics. The framework also significantly reduced the computational costs associated with traditional MD simulations.
Implications
The development of ASTEROID has significant implications for the field of molecular dynamics, providing a more efficient and accurate method for predicting molecular behavior over time. This could accelerate research in biomolecular studies and drug discovery by enabling faster simulations and analyses.
Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts
Theory
Efficient ML
NLP
- Discontinuities in SMoE architectures are classified by order, with lower-order discontinuities being more prevalent.
- The authors establish that random perturbations in input space will almost surely encounter discontinuities, particularly order-1 ones.
- A novel smoothing mechanism is proposed to mitigate the effects of discontinuities, enhancing model performance.
- The analysis provides theoretical insights into the structure and behavior of discontinuities in SMoEs.
Read more
Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts
Summary
This paper investigates the inherent discontinuities in Sparse Mixture-of-Experts (SMoE) architectures, which are prevalent in modern language and vision models. The authors provide a rigorous geometric and stochastic analysis of these discontinuities, classifying them by order based on the number of tied experts during switching events. Using measure-theoretic slicing arguments, they establish that lower-order discontinuities dominate the input space, while higher-order ones occupy a negligible volume. The authors model random perturbations in the input space as a diffusion process, proving that trajectories almost surely encounter a discontinuity, with the first hit likely occurring on an order-1 discontinuity. They also derive bounds on the occupation time near discontinuities, indicating that inputs are more frequently located near lower-order discontinuities. To address the challenges posed by these discontinuities, the authors propose a simple smoothing mechanism that can be integrated into existing SMoEs, which not only enforces continuity but also enhances empirical performance in various tasks. Experimental results across language and vision tasks demonstrate the effectiveness of the proposed smoothing method, ensuring minimal computational overhead while improving model stability.
Methodology
The authors employ geometric and stochastic analysis techniques, including measure-theoretic slicing arguments to classify discontinuities and diffusion processes to model random perturbations in the input space. They derive asymptotic volume estimates and occupation-time bounds to quantify the behavior of inputs near discontinuities. A smoothing mechanism is proposed based on these insights, which is then tested empirically.
Results
The study finds that lower-order discontinuities dominate the input space, while higher-order discontinuities have a vanishingly small relative volume. The proposed smoothing mechanism effectively enforces continuity in the SMoE map and improves empirical performance across various tasks, demonstrating that the added computational overhead remains minimal.
Implications
The findings have significant implications for the design and implementation of SMoE architectures in large-scale models, particularly in language and vision applications. The proposed smoothing mechanism can enhance model stability and performance, making it a valuable addition to existing frameworks.
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors
Time Series
Multimodal
- The hybrid LSTM-ViT framework improves forecast-error prediction skill compared to baseline LSTM models.
- Incorporating vertically resolved atmospheric profiles enhances the model's ability to capture complex PBL processes.
- The largest improvements in predictive skill are observed for precipitation forecast errors, achieving a twofold increase over the baseline.
- The model is particularly effective during periods of enhanced PBL activity and complex atmospheric evolution.
Read more
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors
Summary
This paper presents a novel hybrid architecture combining Long Short-Term Memory (LSTM) networks and Vision Transformers (ViT) to enhance the prediction of forecast errors in high-resolution numerical weather prediction (NWP) systems, specifically the High-Resolution Rapid Refresh (HRRR) model. The authors argue that traditional LSTM models struggle with complex vertical atmospheric processes, which can lead to performance degradation. To overcome this limitation, the proposed LSTM-ViT framework integrates temporal sequence learning from surface observations with vertically resolved atmospheric profiles obtained from the New York State Mesonet profiler network. The model is trained to predict hourly forecast errors for precipitation, wind speed, and temperature at individual mesonet stations. Results indicate that the hybrid architecture significantly improves prediction skill across all three variables, particularly during periods of active planetary boundary layer (PBL) activity, with the most notable enhancement observed in precipitation forecast errors. The findings suggest that incorporating vertically informed attention mechanisms can provide a more accurate representation of atmospheric phenomena, thereby improving operational forecasting capabilities.
Methodology
The study employs a hybrid LSTM-ViT architecture that combines temporal sequence learning from surface observations with vertically resolved atmospheric data from microwave radiometers. The model is trained on historical forecast errors from the HRRR model, focusing on predicting errors in hourly precipitation, wind speed, and temperature at mesonet stations.
Results
The LSTM-ViT framework demonstrated significant improvements in forecast-error prediction skill, particularly for precipitation, where it achieved approximately a twofold increase in predictive skill compared to the baseline LSTM model. The enhancements were most pronounced at shorter forecast lead times and during periods of active PBL processes.
Implications
The findings suggest that integrating vertically informed attention mechanisms into forecasting models can lead to more accurate predictions of weather forecast errors. This has practical implications for operational NWP systems, providing forecasters with improved tools for assessing model biases and enhancing forecast confidence.
KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation
Theory
Efficient ML
Interpretability
- KANLib is a modular and extensible framework for Kolmogorov-Arnold Networks.
- It integrates features from existing KAN implementations to enhance usability and performance.
- The framework supports various configurations and maintains compatibility with PyTorch.
- Experimental results demonstrate KANLib's efficiency and predictive accuracy on benchmark datasets.
Read more
KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation
Summary
This paper presents KANLib, a new framework designed for the development and evaluation of Kolmogorov-Arnold Networks (KANs), which replace linear weights in traditional multilayer perceptrons with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, KANs have faced challenges in practical applications due to high computational costs and inconsistent feature support across existing frameworks. KANLib addresses these issues by providing a modular and extensible architecture that integrates concepts from previous implementations like PyKAN, EfficientKAN, and FastKAN. The framework supports various basis function types, adaptive grid rescaling, and fine-grained architectural customization while ensuring compatibility with PyTorch. Experimental results on the California Housing benchmark indicate that KANLib not only reproduces the predictive performance of established KAN implementations but also achieves competitive computational efficiency. The framework allows for the exploration of architectural variations with minimal impact on predictive performance, establishing a robust foundation for future research in scalable KAN architectures.
Methodology
The authors developed KANLib by consolidating core concepts from existing KAN implementations into a unified framework. They implemented features such as adaptive grid rescaling and customizable architectures, ensuring compatibility with PyTorch workflows. The framework was evaluated through experiments on the California Housing benchmark to assess its predictive performance and computational efficiency.
Results
KANLib successfully reproduced the predictive behavior of established KAN implementations while achieving competitive computational efficiency. The framework demonstrated that it could explore architectural variations beyond standard KAN formulations with only minor impacts on predictive performance.
Implications
KANLib provides a robust platform for researchers to explore and develop KAN architectures, potentially leading to advancements in interpretable machine learning and applications in scientific discovery tasks. Its modular design may accelerate the development of new KAN experiments and facilitate rigorous benchmarking.
Self-CTRL: Self-Consistency Training with Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- Self-CTRL optimizes for consistency between language models' self-explanations and their behavior.
- The method includes two training directions: explanation training and behavior training.
- In probabilistic reasoning tasks, consistency training improved bias reporting significantly.
- In constitutional AI, Self-CTRL greatly enhanced refusal prediction accuracy.
Read more
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Summary
This paper introduces Self-Consistency Training with Reinforcement Learning (Self-CTRL), a novel method aimed at enhancing the consistency between language models' self-explanations and their actual behavior. The authors argue that language models (LMs) that accurately describe their behavior are more trustworthy and understandable. Self-CTRL operates by optimizing LMs to align their explanations with their actions through two main training directions: explanation training, which adjusts explanations to better predict behavior, and behavior training, which modifies behavior to align with stated explanations. The method is evaluated in two domains: a probabilistic reasoning task and a constitutional AI domain. In the first domain, consistency training significantly improved the correlation between self-reported biases and behaviorally-measured biases, achieving an R2 improvement from 0.24 to 0.64. In the second domain, Self-CTRL enhanced the accuracy of refusal predictions from 36% to 92% and reduced the HarmBench failure rate from 15.0% to 0.5% without increasing refusals on harmless prompts. The findings suggest that aligning explanations and behavior can lead to safer, more transparent, and controllable AI models.
Methodology
Self-CTRL employs a reinforcement learning framework that optimizes language models for consistency between their explanations and behaviors. It involves training on paired meta-level (explanation-eliciting) and object-level (behavior-eliciting) inputs, using external judges or simulators to score consistency and guide the model updates.
Results
The application of Self-CTRL led to a substantial increase in the correlation of self-reported biases with actual behavior in probabilistic reasoning tasks (R2 from 0.24 to 0.64). In the constitutional AI domain, refusal prediction accuracy improved from 36% to 92%, and the HarmBench failure rate decreased from 15.0% to 0.5% without increasing unnecessary refusals.
Implications
The findings suggest that Self-CTRL can enhance the trustworthiness and interpretability of language models, making them safer and more controllable for users. This method could be applied in various AI applications requiring reliable self-explanations and behavior alignment.
MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense
Time Series
- MorphStrata introduces a layer-specific perturbation strategy for time-series forecasting models.
- The method enhances adversarial robustness while maintaining low computational overhead.
- Empirical results show significant improvements in adversarial RMSE, especially in high entropy datasets.
- The approach demonstrates a positive correlation between student model diversity and defense effectiveness.
Read more
MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense
Summary
This paper presents MorphStrata, a novel approach to enhance the robustness of time-series forecasting models against adversarial attacks by employing layer-specific perturbations. Traditional defenses often compromise on either robustness or computational efficiency, particularly in Moving Target Defense (MTD) scenarios where multiple randomized model instances are used. MorphStrata builds upon the Morphence defense strategy by introducing selective, layer-specific stochastic noise injection, utilizing a Transformer backbone as the teacher model. By perturbing specific architectural blocks, MorphStrata creates a diverse pool of student models tailored to different data distributions and threat models. The authors evaluate MorphStrata against baseline models, including vanilla Transformers and Morphence, across various datasets and adversarial attack methods (FGSM, BIM, PGD). The results indicate that MorphStrata maintains competitive adversarial RMSE while achieving significant improvements in robustness, particularly in high entropy datasets like Appliances Energy Prediction (AEP). The approach incurs minimal additional training time, making it a practical enhancement for existing MTD frameworks.
Methodology
MorphStrata employs a layer-specific student generation strategy that selectively applies stochastic noise to distinct components of a Transformer architecture. This is achieved through binary masking, allowing for structured heterogeneity among the generated student models. The evaluation includes comparisons against baseline models using various adversarial attack techniques and metrics, focusing on adversarial RMSE and computational efficiency.
Results
MorphStrata consistently outperforms static baselines, achieving up to 24.11% and 97.97% reductions in adversarial RMSE under FGSM and BIM attacks, respectively, at an epsilon value of 0.5. The method incurs less than 1.1% additional training time compared to the Morphence baseline, indicating its efficiency. A positive correlation was observed between higher pairwise L2 distances among student models and overall defense effectiveness.
Implications
The findings suggest that MorphStrata can be effectively integrated into existing time-series forecasting systems to enhance their resilience against adversarial attacks, making it a valuable tool for applications in energy forecasting and industrial monitoring where accuracy and reliability are critical.
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
Reinforcement Learning
Large Language Models
- EnvRL enhances agentic RL by utilizing environment dynamics as implicit supervision signals.
- The framework introduces two auxiliary objectives: state prediction and inverse dynamics.
- Joint optimization of these objectives with the primary RL goal leads to better decision-making.
- Empirical results show significant improvements in success rates on long-horizon tasks.
Read more
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
Summary
The paper introduces EnvRL, a novel framework designed to enhance agentic reinforcement learning (RL) by leveraging environment dynamics during training. Traditional RL methods often struggle with sparse rewards in long-horizon tasks, which limits their effectiveness. The authors argue that the rich information contained in interaction trajectories can serve as implicit supervision signals, helping agents to better understand environmental transitions. EnvRL incorporates two auxiliary objectives—state prediction and inverse dynamics—into the RL process. By jointly optimizing these objectives with the primary RL goal, the framework encourages agents to internalize environment dynamics, leading to improved decision-making. The authors validate EnvRL through extensive experiments on two long-horizon benchmarks, ALFWorld and WebShop, demonstrating significant improvements in success rates over standard RL baselines, thereby showcasing the effectiveness of learning from environment dynamics.
Methodology
EnvRL employs a framework that integrates environment dynamics learning into agentic RL through two main strategies: state prediction, which models the next state based on the current state and action, and inverse dynamics, which infers the action that caused a state transition. These strategies are optimized alongside the primary RL objective, allowing agents to learn from their interactions with the environment effectively.
Results
The experiments conducted on ALFWorld and WebShop demonstrate that EnvRL significantly enhances success rates compared to traditional RL methods. For instance, when trained with GRPO, the success rate for Qwen-2.5-1.5B-Instruct improved from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop. These results indicate that EnvRL provides substantial benefits in policy optimization across various model scales.
Implications
The findings suggest that incorporating environment dynamics into RL can lead to more robust and effective learning strategies for agents, particularly in complex, long-horizon tasks. This approach could have significant implications for the development of autonomous agents in various applications, including robotics and interactive AI systems.
TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins
NLP
Large Language Models
Efficient ML
- TUNEAHEAD predicts fine-tuning performance before full training, reducing wasted computational resources.
- The framework combines static dataset descriptors with dynamic probe features for accurate predictions.
- SHAP-based attributions provide interpretable diagnostics, helping practitioners understand prediction drivers.
- TUNEAHEAD outperforms strong baselines in extensive experiments, demonstrating its effectiveness.
Read more
TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins
Summary
The paper introduces TUNEAHEAD, a framework designed to predict the performance of fine-tuning large language models (LLMs) before full training begins. Fine-tuning LLMs is often resource-intensive and can lead to suboptimal performance due to various factors such as data quality and hyperparameter settings. TUNEAHEAD addresses this issue by encoding each fine-tuning candidate as a meta-feature vector that combines static dataset descriptors and dynamic probe features derived from a short standardized probe run. This allows the framework to provide performance estimates through a lightweight predictor, while SHAP-based attributions offer insights into the features influencing these predictions. The authors conducted over 1,300 fine-tuning runs on the Qwen2.5-7B-Instruct model, demonstrating that TUNEAHEAD outperforms existing methods like Early-Stop Extrapolation and ProxyLM. The framework achieved an RMSE of 1.47 percentage points on a held-out test set, with 95.1% of predictions falling within ±3 percentage points of actual scores. This predictive capability enables practitioners to make informed go/no-go decisions, significantly reducing unnecessary computational costs and improving resource efficiency in model training.
Methodology
TUNEAHEAD employs a hybrid feature space that integrates static dataset descriptors (e.g., lexical diversity, data size) with dynamic probe features (e.g., early loss decay, gradient stability) from a short probe run. A lightweight gradient-boosting predictor maps these features to expected performance outcomes, while SHAP-based attributions explain the predictions by identifying influential dataset properties.
Results
In experiments involving over 1,300 fine-tuning runs, TUNEAHEAD achieved an RMSE of 1.47 percentage points on a held-out test set of 370 runs. It successfully placed 95.1% of its predictions within ±3 percentage points of the true performance scores, outperforming established baselines such as Early-Stop Extrapolation and ProxyLM.
Implications
TUNEAHEAD's ability to predict fine-tuning success before full training can lead to significant cost savings in computational resources and time. It allows practitioners to refine datasets and hyperparameters proactively, minimizing the risk of failed runs and optimizing the fine-tuning process for LLMs.
Hierarchical Attention via Domain Decomposition
Theory
Efficient ML
NLP
- Introduction of a hierarchical attention mechanism based on domain decomposition.
- Demonstrated efficiency in approximating solution operators for a one-dimensional diffusion problem.
- Outperformed a global low-rank attention baseline in terms of training speed and accuracy.
- Utilized significantly fewer parameters compared to traditional methods.
Read more
Hierarchical Attention via Domain Decomposition
Summary
This paper introduces a novel hierarchical attention mechanism inspired by two-level overlapping Schwarz domain decomposition. The authors argue that this method effectively combines local corrections with global information, making it suitable for finite-dimensional operator learning. The proposed approach is tested on a one-dimensional diffusion problem with homogeneous Dirichlet boundary conditions, where the exact solution operator is known. The authors replace a global softmax-free low-rank attention operator with a two-level structure that integrates local low-rank attention blocks on overlapping subdomains with a coarse attention block. This new operator is shown to train faster and yield more accurate approximations than the baseline, while also using significantly fewer parameters. The findings suggest that domain decomposition concepts can enhance the efficiency of attention mechanisms, particularly in contexts like operator learning and natural language processing.
Methodology
The authors propose a two-level attention mechanism that consists of local low-rank attention blocks on overlapping subdomains combined with a coarse attention block. The method is evaluated through numerical experiments on a discretized one-dimensional Poisson problem, where the goal is to approximate the inverse of a symmetric positive definite matrix representing the solution operator.
Results
The numerical experiments indicate that the hierarchical attention mechanism can train faster and achieve more accurate approximations than the baseline global low-rank attention operator. The proposed method also requires significantly fewer parameters, demonstrating its efficiency and effectiveness in the context of operator learning.
Implications
The findings suggest that hierarchical attention mechanisms inspired by domain decomposition can be beneficial for various applications, including operator learning and potentially in natural language processing tasks where local and long-range dependencies are critical.
Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
Theory
Interpretability
Efficient ML
- MKAN guarantees hard monotonicity for all parameter values through exponential reparameterization and positive edge weights.
- The representation-cost theorem provides a framework for understanding the dimensionality of monotone realizations of feature extractors.
- Empirical results indicate that MKAN is competitive with existing monotone neural networks while offering enhanced interpretability.
- The model successfully recovers ground-truth factors in controlled settings, outperforming traditional methods in terms of Spearman alignment.
Read more
Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
Summary
This paper presents Monotonic Kolmogorov-Arnold Networks (MKAN), a novel architecture that enforces hard monotonicity in neural networks, addressing a gap in existing models that only partially enforce this constraint. The authors introduce a representation-cost theorem that establishes a principled sizing rule for monotone encoders, demonstrating that any feature extractor can be realized monotonically with a bounded increase in output dimensions. MKAN is designed to maintain per-edge functional transparency, allowing for clear interpretability of input-output relationships. Empirical evaluations show that MKAN performs competitively against state-of-the-art monotone neural networks across various benchmarks, while also validating the theoretical predictions regarding representation costs through self-supervised experiments on multiple datasets.
Methodology
The MKAN architecture employs exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation function to ensure monotonicity. The training process utilizes standard unconstrained gradient descent, simplifying the optimization compared to previous models. The representation-cost theorem is derived theoretically to establish the relationship between non-monotone and monotone feature extractors.
Results
MKAN demonstrated competitive performance on the SMM/ICML-2024 benchmark suite, excelling in both classification and regression tasks. It validated the theoretical prediction of the representation-cost theorem through a self-supervised feature-size sweep across four datasets, achieving significantly higher Spearman alignment with ground-truth factors compared to KAN, MLP, and linear baselines.
Implications
The findings suggest that MKAN can be effectively utilized in applications requiring monotonic responses, such as in scientific, economic, and tabular data contexts. The model's interpretability and performance make it a valuable tool for practitioners needing reliable and transparent predictions.
Towards Anomaly Detection on Relational Data
Graph Learning
- RelAD effectively addresses the challenges of feature redundancy and complex relational dependencies in relational anomaly detection.
- The framework integrates both attribute and relational edge reconstruction to enhance anomaly detection accuracy.
- Extensive experiments on benchmark datasets show that RelAD outperforms existing anomaly detection methods.
Read more
Towards Anomaly Detection on Relational Data
Summary
This paper addresses the challenge of detecting anomalies in relational databases, which are commonly used to manage structured data in various applications. The authors identify two primary challenges in relational anomaly detection (RAD): feature redundancy and signal dilution due to high-dimensional and heterogeneous attributes, and the complexity of relational dependencies among entities. To tackle these challenges, they propose a novel framework called RelAD, which employs a reconstruction-based approach to capture anomalies through both attribute and relational edge reconstruction. RelAD consists of two main components: a conditional sparse-gated attribute reconstruction module that focuses on significant abnormal semantic blocks while minimizing redundant features, and a dual-view multi-relational edge reconstruction module that identifies relation-specific abnormal connections. The integration of these components through a lightweight fusion module results in a final anomaly score. The authors also construct six benchmark datasets with systematic anomalies and conduct extensive experiments, demonstrating that RelAD consistently outperforms existing baseline methods while maintaining competitive efficiency.
Methodology
The proposed RelAD framework utilizes a reconstruction-based approach with two core modules: conditional sparse-gated attribute reconstruction to filter out redundant features, and dual-view multi-relational edge reconstruction to capture relation-specific anomalies. These components are combined in a lightweight fusion module to produce anomaly scores.
Results
The experiments conducted on six benchmark datasets reveal that RelAD consistently outperforms traditional tabular and graph anomaly detection methods, demonstrating its effectiveness in accurately identifying anomalies in relational data while achieving competitive computational efficiency.
Implications
The findings suggest that RelAD can be applied in various domains such as fraud detection, risk management, and abnormal behavior identification in relational databases, enhancing the ability to uncover hidden anomalies in complex data structures.
Target-confidence Recourse Using tSeTlin machines: TRUST
Interpretability
Optimization
Theory
- TRUST allows users to specify desired prediction confidence levels for counterfactual explanations.
- The framework uses a Probabilistic Tsetlin Machine (PTM) to enhance the interpretability and robustness of recourse options.
- Counterfactuals generated by TRUST are more stable and less fragile compared to traditional boundary-based approaches.
- The methodology provides clause-level attribution, explaining the reliability of different counterfactuals.
Read more
Target-confidence Recourse Using tSeTlin machines: TRUST
Summary
The paper introduces Target-confidence Recourse Using tSeTlin machines (TRUST), a novel framework for generating counterfactual explanations (CEs) in high-stakes decision-making systems. Traditional methods typically focus on minimal changes to inputs that flip model decisions, often leading to fragile and unstable counterfactuals. TRUST allows users to specify a desired prediction confidence level, enabling the generation of more robust and interpretable recourse options. By employing a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization, the framework directly searches for minimal input changes that meet user-defined confidence targets. This approach not only enhances the stability of counterfactuals but also provides clause-level explanations that clarify the reliability of different recourse options. Experimental results on synthetic and real-world datasets demonstrate that TRUST achieves perfect robustness while maintaining low recourse costs, thus offering a significant advancement in the field of algorithmic recourse.
Methodology
The authors propose a new formulation for counterfactual explanations that incorporates user-defined confidence levels as optimization constraints. The framework utilizes a Probabilistic Tsetlin Machine (PTM) to generate counterfactuals and employs Bayesian optimization to search for minimal changes that satisfy these constraints. This approach allows for the assessment of counterfactuals based on their stability and reliability, providing deeper insights into the decision-making process.
Results
The experiments conducted on both synthetic and real-world datasets indicate that the counterfactuals produced by TRUST are significantly more robust and interpretable than those generated by traditional methods. For instance, on the Haberman dataset, TRUST achieved a low recourse cost (L2 distance of 0.10) while maintaining a high confidence level (0.92). The results demonstrate that the framework can effectively balance cost, confidence, and robustness in generating actionable decision support.
Implications
The TRUST framework has significant implications for high-stakes decision-making systems, such as credit approval and clinical risk assessment, where understanding the rationale behind decisions is crucial. By providing stable and interpretable counterfactuals, TRUST can enhance transparency and trust in algorithmic decision-making processes, aligning with regulatory requirements for explainability.
On the Residual Scaling of Looped Transformers: Stability and Transferability
Theory
Optimization
Large Language Models
- The standard scaling ε = 1/√N is insufficient for looped Transformers due to weight sharing.
- A new scaling ε = 1/N is proposed to stabilize training and control residual growth.
- The derived parameterization ε = λ/(N√L) allows for effective hyperparameter transfer across different loop counts and depths.
- Experiments validate that linear residual scaling enhances trainability and maintains optimal learning rates across various configurations.
Read more
On the Residual Scaling of Looped Transformers: Stability and Transferability
Summary
This paper investigates the residual scaling in looped Transformers, which utilize a shared residual block across multiple iterations to increase effective depth without adding parameters. The authors identify that the conventional scaling factor ε = 1/√L, which is effective for deep residual networks, is inadequate for looped architectures due to the correlation of residual updates caused by weight sharing. They propose a new scaling factor ε = 1/N, which effectively stabilizes the training process by controlling the growth of the residual-stream norm. Furthermore, they derive a factored parameterization ε = λ/(N√L) for multi-layer blocks, allowing for independent control of within-layer and across-layer variance. This leads to a significant finding that the optimal learning rate is determined solely by the number of unique layers L, facilitating hyperparameter transfer from smaller to larger loop counts without the need for retuning. Experimental results demonstrate that the proposed scaling improves trainability and maintains consistent learning rates across varying loop counts, confirming the theoretical predictions.
Methodology
The authors conducted a theoretical analysis of the residual scaling in looped Transformers, deriving new scaling rules based on the correlation of updates due to weight sharing. They also performed experiments on decoder-only Transformers trained on the FineWeb-Edu dataset to validate their theoretical findings and assess the impact of different scaling factors on training stability and learning rate transferability.
Results
The experiments confirmed that using the proposed scaling ε = 1/N significantly improved the trainability of looped Transformers compared to the traditional ε = 1/√N scaling. The optimal learning rate remained stable across different loop counts and depths, demonstrating the effectiveness of the new parameterization in maintaining performance without requiring retuning.
Implications
The findings suggest that the proposed scaling rules can enhance the design and training of looped Transformers, making them more robust and easier to tune. This could lead to improved performance in various applications of Transformers, particularly in scenarios requiring iterative processing or deep architectures.
No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems
Theory
- Unfairness in learning systems arises from intrinsic structural properties rather than just biased data.
- There exists a fundamental fairness-cost trade-off that limits the ability to achieve both high performance and fairness.
- Finite sample sizes can lead to unavoidable disparities, even in ideal conditions where fair solutions exist.
- The choice of model can create inherent limitations in achieving fairness, independent of data quality.
Read more
No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems
Summary
This paper presents the No-Free-Fairness theorems, which establish fundamental limits and trade-offs in achieving fairness within learning systems. The author identifies three primary sources of disparity: (1) Task-inherent limits where subgroup performance is coupled with overall performance, leading to a fairness-cost trade-off; (2) Algorithmic and statistical limits that arise from finite sample sizes, which can exacerbate disparities even in ideal conditions; and (3) Structural limits imposed by the choice of model, where certain models may not adequately represent subgroup patterns, resulting in persistent unfairness. The findings indicate that unfairness is not merely a result of biased data or suboptimal algorithms but is deeply rooted in the intrinsic properties of decision problems, data constraints, and model expressivity. The paper emphasizes that achieving fairness requires explicit trade-offs and should be a core consideration in the design of learning systems.
Methodology
The author employs theoretical analysis to derive impossibility results regarding fairness in learning systems, focusing on the interplay between task structure, statistical learning, and model expressivity. The study uses a risk ratio framework to assess disparities and establishes a unified set of No-Free-Fairness results.
Results
The paper concludes that achieving fairness in learning systems is fundamentally constrained by task-inherent limits, algorithmic and statistical limitations, and structural model constraints. These results demonstrate that unfairness is an inevitable aspect of learning systems that cannot be eliminated solely through improved data or algorithmic design.
Implications
The findings suggest that practitioners and researchers must acknowledge the inherent trade-offs between fairness and performance in machine learning applications. This understanding can guide the development of more equitable algorithms and inform policy decisions in high-stakes domains such as finance and healthcare.
Delta-Based Target Reformulation for Short-Term Electricity Load Forecasting Using LSTM and Transformer Models
Time Series
- Delta-based target reformulation improves hour-ahead forecasting accuracy by over 50% MAPE compared to absolute formulations.
- LSTM and Transformer models benefit from delta targets, especially for short-term predictions.
- LightGBM shows competitive performance under absolute formulations but struggles with error accumulation in multi-step delta reconstruction.
- The study is validated on eight years of real-world data, ensuring operational relevance.
Read more
Delta-Based Target Reformulation for Short-Term Electricity Load Forecasting Using LSTM and Transformer Models
Summary
This paper addresses the challenge of accurate short-term electricity load forecasting, which is essential for the efficient operation of power systems affected by non-stationarity due to weather, calendar effects, and changing consumption patterns. The study introduces a delta-based target reformulation approach, inspired by classical time-series differencing techniques, to improve forecasting accuracy. Instead of predicting absolute load values, the proposed method focuses on forecasting the change in load between consecutive time steps, which is then used to reconstruct absolute forecasts. The research employs multi-year, hourly-resolution electricity load data from India, enhanced with meteorological and calendar features, to evaluate the performance of Long Short-Term Memory (LSTM) and Transformer models against state-of-the-art Gradient Boosting Decision Trees (LightGBM). The experiments cover short forecasting horizons, specifically hour-ahead and day-ahead predictions, and utilize Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) for performance assessment. The findings reveal that the delta-based reformulation significantly enhances hour-ahead forecasting accuracy, achieving over 50% MAPE reduction compared to absolute formulations. For day-ahead forecasting, delta targets particularly benefit deep learning models, while tree-based models perform competitively under absolute formulations. The study concludes that delta reformulation serves as a beneficial inductive bias for neural networks, though its effectiveness varies with the model and forecasting horizon.
Methodology
The study employs a delta-based target reformulation approach for short-term electricity load forecasting, training LSTM and Transformer models to predict changes in load rather than absolute values. The methodology includes data augmentation with meteorological and calendar features, rigorous benchmarking against LightGBM, and performance evaluation using MAE and MAPE metrics.
Results
The results indicate that delta-based reformulation consistently enhances forecasting accuracy for hour-ahead predictions across all models, with MAPE reductions exceeding 50%. For day-ahead forecasting, delta targets improve performance for deep learning models, while LightGBM performs well under absolute formulations but faces challenges with multi-step delta reconstruction.
Implications
The findings suggest that delta-based target reformulation can be a powerful strategy for improving forecasting accuracy in short-term electricity load predictions, with potential applications in power system operations and energy management.
Attention as Frustrated Synchronization
Theory
Efficient ML
NLP
- Introduces the Frustrated Synchronization Network (FSN) as a new attention mechanism.
- FSN replaces consensus in traditional attention with frustrated synchronization, enhancing predictive capabilities.
- Achieves lower validation loss compared to tuned transformers, especially in long-range copying tasks.
- Utilizes a coupling kernel that is directly interpretable through synchronization concepts.
Read more
Attention as Frustrated Synchronization
Summary
This paper introduces the Frustrated Synchronization Network (FSN), a novel attention mechanism that modifies the traditional self-attention approach by replacing consensus with frustrated synchronization. The FSN is inspired by the behavior of coupled oscillators, where each token in a sequence is coupled to the successors of the tokens it attends to, rather than synchronizing towards them. This method allows the FSN to retrieve context while also continuing it, addressing limitations in prediction tasks associated with standard attention mechanisms. The FSN employs a coupling kernel derived from synchronization literature, enabling interpretability of the learned parameters. Experimental results demonstrate that the FSN outperforms a tuned transformer model in terms of validation loss, particularly excelling in long-range copying tasks. The FSN achieves lower validation loss at matched parameters on character-level text and code datasets, indicating its potential for efficient learning in sequential data.
Methodology
The FSN processes sequences of tokens through residual layers, using a score map for relevance and a coupling kernel for updates. It employs Kuramoto–Sakaguchi coupling to adjust token states based on successors rather than direct agreement, allowing for a data-driven frustration angle. The attention scores are calculated using a gated sum of phase coherences, and the update direction is determined by a learned complex kernel over harmonics.
Results
The FSN achieved a validation loss of 1.6050 bits per character on the enwik8 dataset, outperforming a tuned transformer at 1.6258 bits per character. It demonstrated superior performance in long-range copying tasks, reversing a previous deficit of Kuramoto attention. The FSN's advantage persisted across various parameter settings, indicating its robustness and efficiency.
Implications
The FSN's approach to attention mechanisms could lead to advancements in models requiring long-range dependencies, such as language models and sequence prediction tasks. Its interpretability through synchronization concepts may also enhance understanding and trust in model decisions.
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
NLP
Large Language Models
Reinforcement Learning
- MAST effectively reduces collateral damage during unlearning compared to full-parameter updates.
- Mechanism separation is crucial, as SFT and RLVR updates differ significantly in their impact on model behavior.
- MAST achieves meaningful forgetting while preserving performance on non-target tasks, demonstrating its robustness across different models.
- Standard evaluation metrics for unlearning may be insufficient, as they can overlook the complexities of reasoning updates.
Read more
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
Summary
This paper introduces Mechanism-Aligned Selective Targeting (MAST), a novel method for unlearning in the context of reasoning induced by Reinforcement Learning with Value-based Reasoning (RLVR). The authors highlight the limitations of standard full-parameter updates, which often lead to collateral damage in retaining non-target capabilities. MAST operates by ranking attention-projection tensors based on off-principal energy, update magnitude, and forget-gradient coupling, allowing for targeted updates that minimize negative impacts on retained knowledge. The study evaluates MAST on two model families, demonstrating its effectiveness in achieving statistically significant forgetting of target behaviors while preserving performance on non-target tasks. The findings suggest that traditional unlearning methods may not adequately address the unique challenges posed by RLVR-induced reasoning, emphasizing the need for mechanism-guided approaches.
Methodology
The authors developed MAST, which ranks attention-projection tensors based on a mechanism score and restricts updates to the top-ranked subset. They conducted experiments using two model families, Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, to evaluate the effectiveness of MAST against standard full-parameter updates. The methodology involved constructing matched SFT and RLVR checkpoints and assessing forgetting on specific task categories while measuring performance on retained tasks.
Results
MAST induced statistically significant forgetting of target behaviors (MATH forget accuracy improved from 45/150 to 37/150, p = 0.0078) while preserving performance on GSM8K (+0.8 percentage points) and maintaining MATH retention (-0.5 percentage points). The advantages of MAST were consistent across different seeds and model objectives, highlighting its effectiveness in minimizing collateral damage during unlearning.
Implications
The findings suggest that MAST could be applied in scenarios where selective unlearning is necessary, particularly in large language models and other AI systems where retaining certain capabilities while suppressing others is critical. This work may influence future research on model editing and unlearning strategies, particularly in reinforcement learning contexts.
Performance-Driven Environment Abstraction with Multi-Timescale Learning
Reinforcement Learning
Robotics
Theory
- Introduces performance-driven environment abstraction optimizing decision quality in MDPs.
- Establishes performance guarantees separating value-function approximation error from action-sharing error.
- Develops a multi-timescale reinforcement learning algorithm that adapts policy and abstraction jointly.
- Empirical results show improved sample efficiency and faster replanning compared to existing methods.
Read more
Performance-Driven Environment Abstraction with Multi-Timescale Learning
Summary
This paper addresses the challenge of decision-making in large Markov decision processes (MDPs) by proposing a performance-driven environment abstraction approach. Unlike traditional methods that focus on preserving geometric or topological structures, the authors advocate for abstractions that optimize decision quality directly. They model the abstraction as a controlled approximation achieved through state space aggregation and shared action distributions within aggregated states. The paper establishes performance guarantees that differentiate between value-function approximation errors and errors from action sharing. To implement this, a multi-timescale reinforcement learning framework is developed, which adapts both the policy and a tree-structured environment abstraction. This framework refines and coarsens the state space based on Q-value discrepancies, effectively balancing performance with abstraction complexity. Empirical evaluations show that the proposed method achieves significant state compression, enhances sample efficiency, and accelerates replanning compared to actor-critic baselines, demonstrating its effectiveness in improving decision-making in complex environments.
Methodology
The authors propose a controlled state space aggregation approach for environment abstraction, focusing on optimizing decision quality rather than preserving geometric structures. They derive performance guarantees that identify sources of suboptimality and develop a multi-timescale reinforcement learning framework that adapts both the policy and the abstraction based on Q-value discrepancies.
Results
The proposed method achieved substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines. The empirical results validate the effectiveness of the performance-driven abstraction in enhancing decision-making capabilities in complex environments.
Implications
This work has significant implications for autonomous systems and robotics, where efficient decision-making in large state spaces is crucial. The ability to adaptively construct hierarchical representations can lead to more intelligent and responsive agents capable of handling complex tasks in dynamic environments.
MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion
Computer Vision
- MM++ is a fully unsupervised and strictly post-hoc framework for OOD detection.
- It employs a Top-K gated feature fusion mechanism to select the most informative layers based on entropy density drops.
- The framework utilizes a Ledoit–Wolf regularized tied covariance matrix for stable distance estimation.
- MM++ demonstrates superior performance across diverse architectures and challenging datasets, including long-tailed distributions.
Read more
MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion
Summary
The paper introduces MM++ (Multilayer Mahalanobis++), a novel framework for Out-of-Distribution (OOD) detection that is fully unsupervised, strictly post-hoc, and scale-invariant. MM++ addresses the challenges of scale sensitivity and hierarchical expressivity by constructing a joint feature space that captures cross-layer correlations while reducing noise from early layers. The framework identifies discriminative intermediate layers through entropy density drops, fusing these with terminal representations to enhance detection capabilities. A Ledoit–Wolf regularized tied covariance matrix stabilizes the unified feature space, allowing for reliable distance estimation without requiring auxiliary OOD data or classifier fine-tuning. The empirical validation shows that MM++ performs robustly across various architectures, including global attention, hierarchical attention, and convolutional networks, outperforming existing methods, particularly in challenging near-OOD scenarios.
Methodology
MM++ constructs a joint feature space by identifying discriminative intermediate layers using covariance entropy and entropy density drops. It employs a Top-K selection mechanism to focus on the most informative layers, concatenating ℓ2-normalized features into a unified representation. A single joint Mahalanobis++ distance is computed using a shared precision matrix estimated via Ledoit–Wolf shrinkage, which captures cross-layer dependencies effectively.
Results
The empirical results indicate that MM++ consistently outperforms state-of-the-art methods across various architectures, including ViTs, Swin, and ConvNeXt, particularly in near-OOD detection tasks. It shows higher resilience to architectural variance and class imbalance, achieving significant improvements in OOD detection performance on datasets like ImageNet-V2, -C, -ES, and -R.
Implications
The MM++ framework has significant implications for deploying deep learning models in safety-critical applications, such as medical diagnosis and autonomous driving, where reliable OOD detection is crucial. Its unsupervised nature and independence from auxiliary data make it a practical solution for real-world applications.
Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Multimodal
- Foundation models can effectively extract representations from multimodal cancer data.
- Image and omics modalities provide complementary information for predictive tasks.
- Multimodal fusion strategies can enhance performance, particularly when modalities are balanced.
- Conformal prediction offers a method for assessing model trustworthiness and uncertainty.
Read more
Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Summary
This paper investigates the effectiveness of foundation models (FMs) as representation extractors for multimodal cancer analysis, specifically focusing on whole-slide images and transcriptomic profiles. The study evaluates FM representations across two real-world datasets, IH-BC and IH-NSCLC, using a systematic approach that includes unimodal probing, multimodal fusion, and trustworthiness assessment through conformal prediction. The authors benchmark the performance of five FMs on eight classification tasks, revealing that image and omics representations provide complementary predictive signals. They explore three fusion strategies to combine these modalities, finding that multimodal fusion enhances performance primarily when no single modality dominates. Additionally, the study assesses the trustworthiness of the models, demonstrating that conformal prediction can effectively quantify uncertainty and maintain reliable diagnostic performance across demographic subgroups. The findings suggest that FM representations are competitive on out-of-distribution data, highlighting the importance of uncertainty-aware inference in clinical applications.
Methodology
The study employs a systematic evaluation framework that includes unimodal probing of image and omics representations, multimodal learning through fusion strategies, and trustworthiness analysis using conformal prediction. The evaluation is conducted on two in-house cancer cohorts with paired whole-slide images and transcriptomic data, focusing on eight classification tasks.
Results
The results indicate that FM representations achieve competitive performance on out-of-distribution data. Unimodal probing reveals complementary predictive signals from image and omics data. Multimodal fusion strategies improve performance on certain tasks, while conformal prediction demonstrates that most failures in point predictions can still yield recoverable diagnoses within the prediction set, underscoring the importance of uncertainty quantification.
Implications
The findings have significant implications for clinical decision-making in oncology, suggesting that leveraging multimodal data through foundation models can enhance diagnostic accuracy and reliability. The study also emphasizes the need for uncertainty-aware models in high-stakes medical applications, potentially improving patient outcomes.
LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents
Reinforcement Learning
Large Language Models
Optimization
- LLMZERO discovers adaptive training strategies that significantly outperform traditional fixed schedules.
- Capacity parameters accumulate monotonically, while regularization parameters oscillate, necessitating adaptive strategies.
- The system uses LLM agents to analyze training dynamics and propose coordinated hyperparameter transitions.
- Improvements range from 9% to 140% over baseline models and 6% to 15% over grid search methods.
Read more
LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents
Summary
The paper introduces LLMZERO, a novel system that leverages large language model (LLM) agents to discover adaptive training strategies for reinforcement learning (RL) post-training. The authors identify that traditional fixed training schedules are suboptimal due to their inability to adapt to the non-stationary dynamics of training. They observe that capacity parameters accumulate monotonically while regularization parameters oscillate in response to training dynamics. LLMZERO employs a tree search mechanism where LLM agents analyze training trajectories, diagnose issues, and propose coordinated multi-parameter transitions. The system was tested on four diverse RL tasks, demonstrating significant improvements over baseline models and grid search methods. The findings reveal that optimal strategies are dataset-dependent and exhibit non-monotonic regularization trajectories, providing actionable design principles for the RL community.
Methodology
LLMZERO utilizes a tree search approach where LLM agents evaluate training dynamics through textual metrics and visualizations. The agents propose hyperparameter transitions based on observed training states, employing an early stopping mechanism to terminate unpromising branches. The system balances exploration and exploitation using Upper Confidence bounds applied to Trees (UCT) search, allowing for multi-stage strategy composition.
Results
The results indicate that LLMZERO's adaptive strategies lead to improvements of 9% to 140% relative to base models and 6% to 15% relative to grid search across four RL tasks. The system demonstrates high iteration efficiency, finding optimal strategies within the first 12 iterations on three out of four tasks.
Implications
The findings suggest that adaptive training strategies can enhance RL post-training processes, leading to better performance across various tasks. The principles derived from the study can inform future research and practical applications in RL, particularly in developing more responsive training methodologies.
Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring
Graph Learning
Generative Models
Time Series
- GGATN integrates graph neural networks with Transformer architectures for event sequence generation.
- The model generates full event sequences in a single pass, addressing limitations of autoregressive methods.
- Experiments show GGATN achieves superior generation quality compared to existing baselines.
- The architecture preserves global process structure while modeling position-specific dependencies.
Read more
Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring
Summary
This paper addresses the challenge of structurally constrained event sequence generation in predictive process monitoring (PPM), where generated sequences must adhere to specific constraints such as transition feasibility, temporal order, and attribute consistency. The authors propose a novel architecture called the Graph Grounded Cross Attention Transformer Neural Network (GGATN) that integrates a global process graph with Transformer-based self-attention mechanisms. GGATN generates full event sequences in a single pass, avoiding the limitations of autoregressive models that can lead to error accumulation. The architecture consists of three main stages: global process graph learning, sequence-level contextual modeling, and graph grounded cross attention interaction. This design allows for the preservation of global process regularities while capturing position-specific dependencies. The authors conducted experiments on six benchmark event logs, demonstrating that GGATN outperforms local instruction-prompted large language model baselines in terms of sequence similarity, control flow similarity, and duration distribution, while maintaining zero hallucinated activities and attribute inconsistencies. The results indicate that the global graph encoder serves as a stable structural prior, and interpretability analyses reveal how various components of the model contribute to the generation process.
Methodology
The GGATN architecture is structured into three stages: 1) global process graph learning to capture the overall process structure, 2) sequence-level contextual modeling using Transformer self-attention to understand position-specific dependencies, and 3) graph grounded cross attention to integrate the process topology with sequence dynamics. The model employs a Viterbi-style graph constrained decoding to ensure the feasibility of generated paths.
Results
GGATN demonstrated significant improvements in sequence generation quality across multiple metrics, including sequence similarity, Damerau Levenshtein similarity, and control flow similarity. The model maintained zero hallucinated activities and zero inconsistencies in sequence-level attributes, outperforming local instruction-prompted LLM baselines.
Implications
The proposed GGATN model has the potential to enhance predictive process monitoring by providing a more reliable and interpretable method for full event sequence generation. This can facilitate better decision-making in various domains, such as healthcare and business process management, where understanding complete future trajectories is crucial.
Informative Missingness to Generate Irregular Clinical Time Series
Generative Models
Time Series
- Introduces a diffusion-based framework for generating clinical time series that captures both lab values and their observation patterns.
- Highlights the importance of modeling informative missingness in clinical data rather than treating it as a preprocessing artifact.
- Demonstrates that the generated synthetic data closely matches real patient trajectories, indicating the model's effectiveness.
- Provides a preprocessing protocol that maintains MNAR-like structure while being compatible with diffusion training.
Read more
Informative Missingness to Generate Irregular Clinical Time Series
Summary
This paper addresses the challenge of generating realistic clinical time series data that reflects the irregular collection of laboratory tests in electronic health records (EHRs). The authors propose a diffusion-based approach that models both laboratory values and their observation patterns, emphasizing that the absence of a test can be as informative as the measurement itself. By utilizing the Data Analytics Challenge on Missing data Imputation (DACMI) benchmark derived from MIMIC-III, the authors align chart times into 4-hour intervals and segment admissions into 7-day windows, creating trajectories that pair lab values with corresponding observation indicators. The method extends the TimeDiff framework to learn continuous lab values alongside discrete missingness patterns through complementary diffusion objectives. Experimental results demonstrate that the generated data closely resembles real patient trajectories, capturing clinically relevant dependencies between patient physiology and clinician testing behavior under missing-not-at-random (MNAR) conditions. The findings suggest that this model can serve as a foundational component for developing clinical foundation models, paving the way for future work on Prior-Data Fitted Networks that leverage informative missingness.
Methodology
The authors developed a generative framework using diffusion models to produce synthetic clinical time series data. They aligned chart times into 4-hour intervals and segmented admissions into 7-day windows, creating a paired representation of continuous lab values and binary observation indicators. The model was trained using a modified TimeDiff framework that incorporates both continuous and discrete objectives, with preprocessing steps to ensure realistic sampling and representation of missingness.
Results
The experiments showed that the generated synthetic data closely matched the distributions of real patient trajectories across individual lab values and joint value-missingness embeddings. The model effectively captured the dependencies between patient physiology and clinician testing behavior, demonstrating its potential in generating high-fidelity synthetic clinical data.
Implications
This work has significant implications for the development of clinical foundation models and synthetic data generation in healthcare. By accurately modeling informative missingness, the proposed framework can enhance the training of machine learning models that rely on clinical data, ultimately improving patient care and research outcomes.
Perron--Frobenius Operator Matching for Generative Modeling
Generative Models
Theory
Optimization
- PFOM generalizes density evolution matching beyond first-order descriptions.
- Only Kullback–Leibler divergence preserves equality between density-level and sample-conditioned objectives.
- Nesterov-accelerated training enhances convergence and reduces discretization errors.
- PFOM demonstrates improved efficiency in generative modeling tasks.
Read more
Perron--Frobenius Operator Matching for Generative Modeling
Summary
This paper introduces the Perron–Frobenius Operator Matching (PFOM) framework, which aims to enhance generative modeling by matching density evolution through the integral Perron-Frobenius operator. PFOM generalizes existing paradigms such as flow and diffusion matching by extending the matching process beyond first-order descriptions to encompass infinitely many orders of density evolution. A significant contribution is the demonstration that among Bregman divergences, only the Kullback–Leibler divergence maintains equality between density-level and sample-conditioned objectives, leading to a practical loss function that aligns with Koopman path matching. The authors also propose a Nesterov-accelerated training and sampling method that stabilizes discretization and accelerates convergence, resulting in improved efficiency in terms of KL, W2, and MMD metrics. Empirical validation shows that PFOM achieves faster convergence and better wall-clock efficiency compared to traditional methods, thereby unifying operator-theoretic identification with contemporary generative modeling approaches. This work opens avenues for adaptive dictionaries and applications in high-dimensional spaces.
Methodology
The authors develop the PFOM framework by leveraging the Perron-Frobenius operator to match density evolution. They establish a connection between density-level objectives and sample-conditioned criteria using Kullback–Leibler divergence. The methodology includes a Nesterov-accelerated optimization approach that alternates between extrapolated evaluations of the PF loss and corrective updates, enhancing the stability and speed of the training process.
Results
Empirical results indicate that PFOM achieves faster decreases in KL, W2, and MMD metrics compared to existing generative modeling methods. The Nesterov-accelerated approach leads to improved wall-clock efficiency and reduced discretization errors, demonstrating the effectiveness of the proposed framework in generative tasks.
Implications
The PFOM framework has significant implications for generative modeling, particularly in applications requiring accurate density estimation and sample generation. Its integration of operator-theoretic principles with modern generative techniques could enhance the modeling of complex systems in various fields, including robotics, finance, and automated control algorithms.