AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
NLP
Large Language Models
Optimization
- MGDA-DECOUPLED offers a geometry-based approach to multi-objective optimization for LLM alignment.
- The method addresses procedural unfairness by considering the convergence dynamics of each objective.
- It operates within the lightweight DPO paradigm, avoiding the complexities of reinforcement learning.
- Experiments show that MGDA-DECOUPLED achieves superior performance in aligning LLMs with human values.
Read more
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
Summary
The paper introduces MGDA-DECOUPLED, a novel geometry-aware multi-objective optimization algorithm designed to enhance the alignment of large language models (LLMs) with human values. Traditional alignment methods often rely on fixed scalarization of multiple objectives, which can lead to procedural unfairness by under-representing minority objectives. MGDA-DECOUPLED addresses this issue by finding a shared descent direction while considering the convergence dynamics of each objective. Unlike prior methods that depend on reinforcement learning or explicit reward models, MGDA-DECOUPLED operates within the Direct Preference Optimization (DPO) framework, making it more lightweight and efficient. The authors conducted experiments using the UltraFeedback dataset, demonstrating that MGDA-DECOUPLED outperforms existing methods in achieving higher win rates against golden responses across multiple objectives. This approach not only improves the alignment of LLMs but also promotes fairness in the representation of diverse human values during the training process.
Methodology
The authors propose a multi-objective optimization framework that utilizes multiple pairwise preference datasets for distinct objectives. MGDA-DECOUPLED computes independent gradients and losses for each dataset and dynamically combines them to create a unified update vector for model training. This method leverages the geometry of the gradient landscape to find a common descent direction that improves all objectives simultaneously, thus automating the selection of scalarization coefficients.
Results
The experiments conducted on the UltraFeedback dataset reveal that MGDA-DECOUPLED achieves the highest win rates against golden responses, both overall and for individual objectives, compared to other existing methods. This indicates that the geometry-aware approach effectively balances the trade-offs among conflicting objectives in LLM alignment.
Implications
The findings suggest that MGDA-DECOUPLED can significantly enhance the alignment of LLMs with human values, making it a valuable tool for developers aiming to create more equitable AI systems. The methodology also highlights the importance of considering diverse human needs in model training, which could lead to more fair and responsible AI applications.
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Large Language Models
NLP
Generative Models
- The mcdok system was developed for detecting machine-generated code in a multi-domain and multi-language context.
- The system was adapted from the existing mdok approach, focusing on code understanding through appropriate model selection.
- Three subtasks were addressed: binary detection, authorship attribution, and hybrid code detection.
- Results indicate competitive performance, but significant margins remain compared to leading systems, suggesting potential for further enhancements.
Read more
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Summary
The paper presents the mcdok system, developed for SemEval-2026 Task 13, which focuses on the detection of machine-generated code across multiple programming languages. The task is divided into three subtasks: binary detection of machine-generated code, authorship attribution to identify the generator's family, and hybrid code detection that distinguishes between human-written, machine-generated, hybrid, and adversarial code. The authors adapted their previous mdok approach, which was designed for text detection, to better suit the complexities of code detection by selecting more appropriate base models for finetuning. The results demonstrate that the mcdok systems are competitive in all subtasks, although there is room for improvement compared to top-performing systems. The paper emphasizes the challenges posed by advancements in large language models (LLMs) in generating code, making it increasingly difficult to differentiate between human and machine-generated outputs.
Methodology
The mcdok system utilizes a modified version of the mdok approach, focusing on finetuning large language models (LLMs) specifically for code detection tasks. The finetuning process employs QLoRA for parameter-efficient training with 4-bit quantization. The authors selected suitable base models for each subtask: Gemma-3-27B-PT for binary detection, CodeGemma-7B for authorship detection, and Qwen2.5-Coder-14B for hybrid code detection. The training data was carefully balanced and subsampled to ensure effective learning across classes.
Results
The mcdok systems achieved competitive results across all three subtasks in the SemEval-2026 Task 13. However, the authors noted that the performance margins from the top systems were significant, indicating that further improvements could be made to enhance detection capabilities.
Implications
The findings suggest that while current models can effectively detect machine-generated code, there is a need for ongoing research and development to keep pace with advancements in code generation technologies. The methodologies developed could be applied to enhance code security and integrity in software development, as well as in educational contexts where understanding the origin of code is crucial.
Differentially Private Model Merging
Theory
Efficient ML
Federated Learning
- Introduces two data-independent algorithms for merging private models: random selection and linear combination.
- Provides tailored privacy accounting using R´enyi differential privacy and privacy loss distributions.
- Demonstrates the superiority of linear combination over random selection in a case study on mean estimation.
- Empirical results validate the effectiveness of the proposed methods on various datasets.
Read more
Differentially Private Model Merging
Summary
This paper addresses the challenge of adapting machine learning models to varying differential privacy (DP) requirements during inference or deployment. The authors propose a novel approach to merge existing models trained on the same dataset with different privacy and utility trade-offs without requiring additional training. They introduce two post-processing techniques: random selection (RS) and linear combination (LC), which allow for the efficient generation of models that meet specific privacy constraints. The paper provides a thorough privacy accounting framework based on R´enyi differential privacy and privacy loss distributions, demonstrating the theoretical superiority of the linear combination method over random selection in terms of privacy and utility. The authors validate their theoretical findings through empirical evaluations on both synthetic and real-world datasets, showing that their methods improve the privacy/utility trade-offs compared to naive approaches.
Methodology
The authors propose two algorithms for model merging: random selection (RS), which randomly outputs one of the models based on a probability distribution, and linear combination (LC), which deterministically combines models using weighted averages. They also develop methods for optimizing the mixing coefficients to enhance utility while adhering to privacy constraints. Privacy accounting is performed using R´enyi differential privacy and privacy loss distributions.
Results
The study finds that the linear combination method consistently outperforms random selection in terms of privacy and utility trade-offs. Empirical evaluations demonstrate that both RS and LC improve privacy/utility outcomes compared to naive privacy accounting methods across synthetic and real-world datasets.
Implications
The proposed methods enable service providers to efficiently adapt to changing privacy requirements without the need for retraining models, making them suitable for real-world applications where privacy regulations may vary. This approach can enhance the deployment of machine learning models in sensitive environments, ensuring compliance with privacy standards.
Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation
Reinforcement Learning
Robotics
Interpretability
- MTRL effectively utilizes shared knowledge across tasks, indicating successful knowledge sharing.
- Only a small fraction of network weights are task-specific, suggesting minimal specialization is needed.
- Context variables play a crucial role in differentiating tasks within MTRL.
Read more
Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation
Summary
This paper addresses the challenges faced by autonomous underwater vehicles (AUVs) in performing multiple navigation tasks under dynamic and uncertain conditions. Traditional control methods struggle with these complexities, necessitating robust and interpretable control policies. The authors explore multi-task reinforcement learning (MTRL) as a solution, which leverages shared representations to enhance adaptability across tasks. However, the interpretability of these policies remains limited, hindering their real-world application. The study investigates the internal structure of a pretrained MTRL network in the HoloOcean simulator, focusing on task-specific subnetworks that facilitate navigation towards different marine species. The findings reveal that only about 1.5% of the network's weights are dedicated to task differentiation, with a significant emphasis on context variables. This research contributes to understanding the shared and specialized components of MTRL, offering insights for efficient model editing, transfer learning, and continual learning in underwater monitoring.
Methodology
The authors employed a pretrained Double DQN value network for underwater navigation, analyzing task-specific subnetworks by pruning the network for each navigation task. They compared these subnetworks to identify shared and specialized components, conducting initial experiments in MiniGrid before extending to the HoloOcean simulator.
Results
The analysis demonstrated that MTRL networks primarily rely on shared weights for task differentiation, with a small percentage dedicated to task-specific functions. The importance of context variables was highlighted, showing their effectiveness in guiding the network's decision-making across related tasks.
Implications
The findings suggest that enhancing interpretability in MTRL can lead to safer and more reliable AUV navigation policies. The insights gained from task-specific subnetworks can inform future research on model editing, transfer learning, and continual learning, particularly in complex underwater environments.
Transferable SCF-Acceleration through Solver-Aligned Initialization Learning
Optimization
Efficient ML
Theory
- SAIL addresses the supervision problem in ML models for SCF initialization, improving convergence rates.
- The Effective Relative Iteration Count (ERIC) is introduced as a more accurate performance metric for SCF calculations.
- SAIL achieves significant reductions in ERIC across various molecular sizes, outperforming previous methods.
- The method extends the applicability of ML acceleration techniques to larger molecules, enhancing computational efficiency.
Read more
Transferable SCF-Acceleration through Solver-Aligned Initialization Learning
Summary
This paper addresses the challenge of accelerating self-consistent field (SCF) calculations in Kohn-Sham density functional theory (KS-DFT) by improving the quality of initial guesses for molecular geometries. The authors identify that existing machine learning methods, which predict initial guesses based on molecular geometry, often fail when extrapolating to larger molecules due to a supervision problem rather than an extrapolation problem. To overcome this, they propose Solver-Aligned Initialization Learning (SAIL), which trains models by differentiating through the SCF solver end-to-end, focusing on solver dynamics instead of ground-state targets. The paper introduces the Effective Relative Iteration Count (ERIC) as a new metric for evaluating SCF performance, which accounts for hidden computational overhead. The results demonstrate that SAIL significantly reduces ERIC across various molecular sizes, achieving a 37% reduction for PBE, 33% for SCAN, and 27% for B3LYP on the QM40 dataset, and provides a 1.25x wall-time speedup on larger QMugs molecules, thus extending the applicability of ML SCF acceleration to larger drug-like molecules.
Methodology
The authors developed Solver-Aligned Initialization Learning (SAIL), which involves backpropagating through the SCF algorithm to optimize the initial guess for density matrices. This method is trained on solver dynamics rather than traditional ground-state references, allowing for improved performance on larger molecules.
Results
SAIL reduced the Effective Relative Iteration Count (ERIC) by 37% for PBE, 33% for SCAN, and 27% for B3LYP on the QM40 dataset. Additionally, it provided a 1.25x wall-time speedup for larger QMugs molecules, demonstrating its effectiveness in accelerating SCF calculations.
Implications
The findings suggest that SAIL can significantly enhance the efficiency of SCF calculations in computational chemistry, potentially leading to faster scientific discoveries and more efficient resource utilization in high-performance computing environments. This method could be particularly beneficial in drug discovery and materials science where large molecular systems are common.
Unsupervised Learning of Inter-Object Relationships via Group Homomorphism
Computer Vision
Theory
Robotics
- Introduces an unsupervised learning method based on group homomorphism to model inter-object relationships.
- Demonstrates the ability to segment multiple objects and extract motion laws without ground-truth labels.
- Establishes a one-dimensional additive latent space for mapping relative movements between objects.
- Highlights the importance of algebraic geometric constraints in achieving disentangled representations.
Read more
Unsupervised Learning of Inter-Object Relationships via Group Homomorphism
Summary
This paper presents an unsupervised representation learning method that models inter-object relationships through group homomorphism, aiming to mimic the cognitive development of preverbal infants. Unlike traditional deep learning approaches that rely on statistical correlations, the proposed method leverages hierarchical relationships in group operations to autonomously acquire the underlying structure of dynamic scenes. The model integrates object segmentation and motion law extraction, allowing it to decompose pixel-level changes into meaningful transformation components such as translation and deformation. Using interaction scenes derived from developmental science, the authors demonstrate that the model can segment multiple objects without ground-truth labels and accurately map their relative movements into a structured latent space. This approach highlights the potential of algebraic geometric constraints in achieving physically interpretable representations, suggesting a pathway for developing artificial systems with human-like flexibility and intelligence.
Methodology
The methodology involves three main steps: (1) Object segmentation to isolate individual objects from image sequences, (2) Separation of object motion using group homomorphism constraints to distinguish between different motion components, and (3) Extraction of multi-object interactions by relativizing the motion of each object from the perspective of others.
Results
The model successfully segments multiple objects and accurately structures their interactions in a latent space, demonstrating that algebraic constraints can lead to meaningful disentangled representations. The experiments validate the model's capability to operate without labeled data, showcasing its potential for unsupervised learning.
Implications
This research contributes to the understanding of how infants internalize environmental laws and suggests new approaches for constructing artificial systems that exhibit developmental intelligence. The findings may have applications in robotics, cognitive modeling, and advanced AI systems that require flexibility and adaptability in dynamic environments.
Improving Performance in Classification Tasks with LCEN and the Weighted Focal Differentiable MCC Loss
Interpretability
- The modified LCEN algorithm is effective for classification tasks, maintaining interpretability and sparsity.
- LCEN consistently outperforms ten other models in terms of macro F1 score and MCC across multiple datasets.
- The diffMCC loss function leads to better performance compared to traditional weighted cross-entropy loss.
- LCEN achieves an average feature elimination of 56%, enhancing model interpretability.
Read more
Improving Performance in Classification Tasks with LCEN and the Weighted Focal Differentiable MCC Loss
Summary
This paper presents a modified version of the LASSO-Clip-EN (LCEN) algorithm tailored for classification tasks, addressing its previous limitation of being applicable only to regression. The authors demonstrate that the modified LCEN maintains the desirable properties of interpretability and sparsity while achieving high performance across various binary and multiclass classification datasets. The algorithm is evaluated against ten other models, consistently achieving superior macro F1 scores and Matthews correlation coefficients (MCC). Notably, LCEN models remain sparse, eliminating an average of 56% of input features. Additionally, the paper introduces a weighted focal differentiable MCC (diffMCC) loss function, which outperforms traditional weighted cross-entropy loss in training classification models, yielding higher macro F1 scores and MCCs. The findings underscore the effectiveness of LCEN for feature selection in classification and the advantages of using diffMCC for model training, suggesting significant improvements in model performance without sacrificing interpretability.
Methodology
The study involves modifying the LCEN algorithm for classification tasks and evaluating its performance on four widely used binary and multiclass datasets. The modified LCEN is compared against ten other machine learning models, and its feature selection capabilities are assessed by retraining models using only the features selected by LCEN. The diffMCC loss function is also evaluated for its effectiveness in training classification models.
Results
The modified LCEN algorithm achieved higher macro F1 scores and MCCs than most other models tested, with performance only slightly lower than the best-performing models in each dataset. Models trained with the diffMCC loss function consistently outperformed those using weighted cross-entropy loss, with average improvements of 4.9% in macro F1 scores and 8.5% in MCC.
Implications
The findings suggest that LCEN can be a powerful tool for feature selection in classification tasks, particularly in fields requiring interpretable models. The use of the diffMCC loss function can enhance model accuracy, making it a viable alternative to traditional loss functions in classification settings.
The Origin of Edge of Stability
Optimization
Theory
- Introduces the concept of edge coupling to explain the Edge of Stability in gradient descent.
- Derives a recurrence relation and loss-change formula that forces Hessian eigenvalues towards 2/η.
- Classifies fixed points and period-two orbits, providing insights into the dynamics of convergence.
- Extends findings to mini-batch SGD and continuous time, indicating broader implications.
Read more
The Origin of Edge of Stability
Summary
This paper addresses the phenomenon known as the Edge of Stability (EoS) in full-batch gradient descent for neural networks, where the largest Hessian eigenvalue approaches the threshold of 2/η, with η being the learning rate. Previous explanations have focused on local dynamics near this edge but have not clarified why trajectories converge to this threshold from arbitrary initializations. The author introduces the concept of 'edge coupling', a functional that captures the relationship between consecutive iterate pairs and is governed by the gradient descent update. By analyzing the criticality conditions of this edge coupling, the paper derives a recurrence relation with a stability boundary at 2/η and a loss-change formula that forces the curvature towards this threshold. The results show that the edge coupling framework can classify fixed points and period-two orbits, revealing the dynamics of convergence and stability at the edge. The findings extend to mini-batch stochastic gradient descent and continuous time, suggesting a broader applicability of the edge coupling concept in understanding optimization dynamics in neural networks.
Methodology
The paper employs a theoretical approach, introducing edge coupling as a functional on consecutive iterate pairs. It derives criticality conditions and uses mathematical analysis to establish recurrence relations and loss-change formulas that elucidate the dynamics of gradient descent.
Results
The analysis reveals that the edge coupling framework successfully explains why the largest Hessian eigenvalue converges to the threshold 2/η from arbitrary initializations. It provides a comprehensive classification of fixed points and period-two orbits, demonstrating the stability of the system at the edge.
Implications
The findings could enhance the understanding of optimization dynamics in neural networks, potentially leading to improved training strategies and learning rate selection. The edge coupling concept may also inform future research in optimization theory and its applications in machine learning.
Even More Guarantees for Variational Inference in the Presence of Symmetries
Theory
Optimization
- The paper extends previous results on variational inference under symmetries to include FKL and α-divergences.
- New sufficient conditions are derived for the exact recovery of the mean using FKL and α-divergences.
- The authors provide insights into the practical implications of their theoretical findings for choosing variational families.
- The study discusses potential optimization failures when sufficient conditions are not satisfied.
Read more
Even More Guarantees for Variational Inference in the Presence of Symmetries
Summary
This paper addresses the challenges of variational inference (VI) when the variational family is misspecified, particularly in the presence of symmetries in the target distribution. The authors build upon previous work that provided sufficient conditions for the exact recovery of the mean and correlation matrix using location-scale families under certain divergences. They extend these results to include the forward Kullback-Leibler (FKL) divergence and α-divergences, establishing new sufficient conditions for the exact recovery of the target mean. The paper highlights how optimization can fail to recover the target mean when these conditions are not met, offering practical guidelines for selecting the variational family and α-value. The findings contribute to a deeper understanding of how to effectively utilize variational inference in scenarios where the target distribution is not perfectly represented by the chosen variational family.
Methodology
The authors analyze variational inference using f-divergences, particularly focusing on the forward Kullback-Leibler divergence and α-divergences. They derive sufficient conditions for the exact recovery of the mean in the context of location-scale families, leveraging the properties of even symmetry in the target distribution.
Results
The paper establishes that exact recovery of the mean can be guaranteed under mild assumptions for the FKL and provides a more nuanced criterion for the α-divergence that depends on the value of α. These results enhance the understanding of variational inference in the presence of misspecification and symmetries.
Implications
The findings have significant implications for practitioners using variational inference in machine learning and statistics, particularly in cases where the target distribution may not be well-represented by the chosen variational family. The guidelines provided can help improve the robustness and accuracy of inference in various applications.
Fairness under uncertainty in sequential decisions
Reinforcement Learning
Theory
Optimization
- Introduces a taxonomy of uncertainty in sequential decision-making: model, feedback, and prediction uncertainty.
- Formalizes uncertainties using counterfactual logic and reinforcement learning techniques.
- Demonstrates the potential harms of naive decision-making policies that ignore unobserved outcomes.
- Shows that uncertainty-aware exploration can improve fairness metrics in sequential decision systems.
Read more
Fairness under uncertainty in sequential decisions
Summary
This paper addresses the challenge of ensuring fairness in machine learning (ML) algorithms used in sequential decision-making contexts, where decisions are made under uncertainty and feedback from previous decisions influences future actions. The authors introduce a taxonomy of uncertainty in sequential decision-making, categorizing it into model uncertainty, feedback uncertainty, and prediction uncertainty. They formalize these uncertainties using counterfactual logic and reinforcement learning techniques, highlighting the potential harms of naive policies that overlook unobserved outcomes. The paper illustrates how unequal uncertainty and selective feedback can lead to disparities in decision-making processes, particularly affecting historically marginalized groups. Through algorithmic examples and experiments on simulated data, the authors demonstrate that it is possible to reduce outcome variance for disadvantaged groups while maintaining institutional objectives. The framework provided equips researchers and practitioners with tools to better diagnose and govern fairness risks in sequential decision systems, emphasizing the need to explicitly account for uncertainty in fair decision-making.
Methodology
The authors utilize counterfactual logic and reinforcement learning techniques to formalize model and feedback uncertainty. They conduct experiments on simulated data to illustrate how varying degrees of bias affect fairness in sequential decision-making.
Results
The experiments reveal that unequal uncertainty and selective feedback can create disparities in outcomes for different groups. The framework allows for the simultaneous reduction of variance in outcomes for historically disadvantaged groups while achieving the decision maker's objectives, demonstrating the importance of accounting for uncertainty in fair decision-making.
Implications
This work has significant implications for high-stakes decision-making in fields such as finance, healthcare, and criminal justice, where ensuring fairness is crucial. The framework can guide the development of more equitable ML systems that consider the complexities of uncertainty and feedback in sequential decisions.
Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury
Time Series
Interpretability
- CT-Former effectively models irregular clinical intervals without biased data imputation.
- The Causal-Attention module provides transparent causal pathways linking historical physiological events to current predictions.
- CT-Former significantly outperforms existing models in predicting AKI, as validated by extensive experiments.
- The model enhances clinical interpretability, addressing the black-box nature of traditional deep learning approaches.
Read more
Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury
Summary
The paper presents CT-Former, a novel model designed for the early prediction of Acute Kidney Injury (AKI) by addressing the challenges of irregularly sampled electronic health record (EHR) data and the black-box nature of existing deep learning models. CT-Former integrates continuous-time modeling with a Causal-Transformer architecture, allowing it to track patient temporal trajectories without biased data imputation. The model features a Causal-Attention module that generates a directed structural causal matrix, providing clear causal pathways between historical physiological anomalies and current risk predictions. This transparency enhances clinical interpretability, which is crucial for trust in predictive models in healthcare. The authors conducted extensive experiments on the MIMIC-IV cohort, demonstrating that CT-Former significantly outperforms state-of-the-art baselines, confirming its effectiveness and reliability for clinical decision-making.
Methodology
CT-Former employs a continuous-time state evolution mechanism to handle irregular data sampling, eliminating the need for artificial imputation. It utilizes a Causal-Attention module to create a directed causal matrix, which traces the historical onset of physiological shocks and establishes causal relationships between past anomalies and current risk predictions. The training process follows a decoupled two-stage protocol to optimize the causal-fusion process independently.
Results
The experiments conducted on the MIMIC-IV dataset, which included 18,419 patients, showed that CT-Former significantly outperformed established baseline models in terms of predictive accuracy. Additionally, independent validation using TimeSHAP confirmed the model's strong clinical interpretability, making it a trustworthy tool for healthcare professionals.
Implications
The development of CT-Former has significant implications for clinical practice, particularly in the early detection and management of AKI. By providing interpretable predictions, it can enhance clinician trust and facilitate timely interventions, ultimately improving patient outcomes in critical care settings.
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
Time Series
Theory
Efficient ML
- Introduces a hybrid autoregressive model combining transformers with mixed finite element methods for stability.
- Proves preservation of discrete energies and uniform gradient bounds, addressing the exploding gradient problem.
- Achieves a 65× reduction in model parameters while outperforming state-of-the-art models in chaotic system forecasting.
- Demonstrates a significant speedup in real-time simulations, enabling efficient design iterations.
Read more
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
Summary
This paper addresses the challenges of stability in autoregressive modeling of chaotic dynamical systems over long time horizons. The authors propose a hybrid approach that integrates an autoregressive transformer within a novel shooting-based mixed finite element scheme. This method exposes topological structures that ensure provable stability, preserving discrete energies during forward problems and maintaining uniform bounds on gradients to avoid the exploding gradient problem. The architecture combines a vision transformer for latent dynamics, achieving significant performance improvements with a 65× reduction in model parameters and effective long-horizon forecasting of chaotic systems. The authors demonstrate the efficacy of their approach through a 'mini-foundation' model of a fusion component, which shows that only 12 simulations are needed to train a real-time surrogate, resulting in a 9,000× speedup over traditional particle-in-cell simulations. The results indicate that incorporating physical structure into the model can substitute for large data scales, enabling stable and accurate forecasting even in sparse data scenarios.
Methodology
The authors embed learned neural dynamics within a mixed finite element framework, utilizing finite element exterior calculus (FEEC) to ensure stability. They employ a shooting method to couple trajectory segments and a vision transformer for end-to-end latent dynamics, allowing for the extraction of complex nonlinear dynamics from data.
Results
The proposed method successfully forecasts chaotic systems over 10,000 Lyapunov times, reproducing invariant measures beyond the capabilities of neural ODEs. It matches state-of-the-art accuracy on shear flow benchmarks with significantly fewer parameters and achieves a 9,000× speedup in real-time simulations for a fusion component.
Implications
This work has significant implications for scientific computing and real-time simulation in chaotic systems, suggesting that models can be developed with fewer data while maintaining stability and accuracy. It opens avenues for efficient design iterations in engineering applications, particularly in fields requiring rapid simulations like plasma physics and fluid dynamics.
Maximum Entropy Semi-Supervised Inverse Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduction of MESSI, a new algorithm combining MaxEnt-IRL with semi-supervised learning principles.
- Effective integration of unsupervised trajectories into the MaxEnt-IRL framework to resolve policy ambiguity.
- Empirical results indicate significant performance improvements over traditional MaxEnt-IRL.
- Addresses limitations of previous semi-supervised apprenticeship learning methods.
Read more
Maximum Entropy Semi-Supervised Inverse Reinforcement Learning
Summary
This paper presents a novel approach to apprenticeship learning (AL) by integrating maximum entropy principles into inverse reinforcement learning (IRL) within a semi-supervised framework. The authors introduce an algorithm named MESSI (MaxEnt Semi-Supervised IRL) that effectively utilizes both expert and unsupervised trajectories to enhance learning performance. The key innovation lies in addressing the ambiguity of multiple policies that can match expert behavior by incorporating unsupervised data through a pairwise penalty on trajectories. The paper builds on previous work in IRL and semi-supervised learning, particularly addressing the limitations of existing methods like SSIRL, which struggled with the ill-defined nature of IRL and the labeling of generated policies. Empirical evaluations demonstrate that MESSI outperforms traditional MaxEnt-IRL in tasks such as highway driving and grid-world scenarios, showcasing the benefits of leveraging unsupervised data in learning tasks traditionally reliant on expert demonstrations.
Methodology
The authors developed MESSI by integrating unsupervised trajectory data into the MaxEnt-IRL framework using a pairwise penalty approach. This method allows for a principled incorporation of unsupervised data, improving the learning process by addressing the ambiguity of multiple policies that can match expert behavior. The algorithm was evaluated against traditional MaxEnt-IRL in various environments.
Results
Empirical evaluations showed that MESSI significantly outperformed traditional MaxEnt-IRL in both highway driving and grid-world problems, demonstrating the effectiveness of leveraging unsupervised trajectories to enhance learning outcomes.
Implications
The findings suggest that incorporating unsupervised data can substantially improve the performance of inverse reinforcement learning algorithms, making them more robust and applicable in real-world scenarios where expert demonstrations are limited. This approach could be particularly beneficial in fields such as robotics and autonomous driving, where learning from both expert and non-expert behaviors is crucial.
HARBOR: Automated Harness Optimization
Large Language Models
Optimization
- Harness design is a significant factor in the performance of long-horizon language-model agents.
- Automated configuration search is more effective than manual tuning as the flag space increases.
- HARBOR formalizes harness optimization as constrained noisy Bayesian optimization.
- The case study demonstrates the limitations of manual tuning, with only one successful tuning round out of four.
Read more
HARBOR: Automated Harness Optimization
Summary
The paper introduces HARBOR, a framework for Automated Harness Optimization (AHO), which addresses the complexities involved in the harness design of long-horizon language-model agents. The authors argue that harness design is a critical machine-learning problem, particularly as the configuration space expands. They formalize AHO as a constrained noisy Bayesian optimization problem over a mixed-variable, cost-heterogeneous configuration space, incorporating cold-start-corrected rewards and a posterior chance-constrained safety check. The HARBOR algorithm employs a block-additive surrogate model, multi-fidelity cost-aware acquisition, and TuRBO trust regions. The authors validate their approach through a case study involving a production coding agent, comparing manual tuning across four rounds against a fixed task suite. The results reveal that only one tuning round outperformed the baseline, highlighting the challenges of harness optimization and the inadequacies of manual tuning methods. The paper concludes that harness tuning should be treated with the same rigor as hyper-parameter optimization, emphasizing the need for automated approaches in this domain.
Methodology
The authors formalize Automated Harness Optimization (AHO) as a constrained noisy Bayesian optimization problem. They develop the HARBOR algorithm, which utilizes a block-additive surrogate model, multi-fidelity cost-aware acquisition strategies, and TuRBO trust regions to optimize the configuration of a harness in a production coding agent.
Results
In a controlled case study, only one of the four tuning rounds (Round B) achieved a statistically significant improvement over the baseline, scoring 17 out of 89, compared to an all-flags-off baseline of 15 out of 89. Subsequent rounds resulted in decreased performance, underscoring the challenges of harness optimization.
Implications
The findings indicate that harness optimization is a complex problem that requires automated solutions rather than manual tuning. This has implications for the development of more efficient and effective language-model agents, as well as the broader field of machine learning where harness design plays a critical role.
Physics-Guided Dimension Reduction for Simulation-Free Operator Learning of Stiff Differential–Algebraic Systems
Theory
Optimization
Efficient ML
- Introduces an extended Newton implicit layer for enforcing algebraic constraints and quasi-steady-state conditions.
- Achieves significant dimension reduction by focusing only on slow states, improving computational efficiency.
- Demonstrates superior performance on stiff DAE problems compared to traditional soft and hard constraint methods.
- Extends the methodology to multi-component systems with provable convergence.
Read more
Physics-Guided Dimension Reduction for Simulation-Free Operator Learning of Stiff Differential–Algebraic Systems
Summary
This paper addresses the challenges faced by neural surrogates in learning solutions for stiff differential-algebraic equations (DAEs). Traditional methods either rely on soft constraints that lead to significant errors due to stiffness or hard constraints that necessitate trajectory data from stiff integrators. The authors propose an innovative extended Newton implicit layer that enforces algebraic constraints and quasi-steady-state conditions in a single differentiable solve, allowing for the recovery of all fast and algebraic states from slow-state predictions provided by a physics-informed DeepONet. This approach not only eliminates the amplification of stiffness errors but also reduces the output dimension to only the slow states, enhancing efficiency. The methodology is extended to multi-component systems through cascaded implicit layers, ensuring scalability and provable convergence. The proposed method demonstrates superior performance on a grid-forming inverter DAE, significantly outperforming existing baseline methods in terms of accuracy and error rates, while also confirming robust out-of-distribution detection capabilities.
Methodology
The authors develop an extended Newton implicit layer that integrates algebraic constraint enforcement and quasi-steady-state reduction within a single differentiable framework. This layer operates on slow-state predictions from a physics-informed DeepONet, allowing for the exact recovery of fast and algebraic states without requiring trajectory data. The method is further enhanced by cascading implicit layers to handle multi-component systems, ensuring scalability and convergence.
Results
The proposed method was tested on a grid-forming inverter DAE with 21 states and a stiffness ratio of approximately 4,712. The extended Newton layer achieved an error rate of 1.42%, significantly lower than the 39.3% and 57.0% errors from penalty and standard Newton methods, respectively. Additionally, the method allowed for the composition of two independently trained models into a 44-state system with minimal error (0.72%–1.16%). Conformal prediction analysis indicated 90% in-distribution coverage with effective out-of-distribution detection.
Implications
This work has significant implications for the simulation and control of complex systems modeled by stiff DAEs, such as power grids and chemical reaction networks. The ability to learn operator solutions without extensive trajectory data can lead to more efficient and scalable computational methods in engineering and applied sciences.
Transparent Screening for LLM Inference and Training Impacts
Large Language Models
- Introduces a transparent screening framework for estimating LLM impacts.
- Develops a bounded multi-factor proxy methodology for inference and training estimates.
- Provides an operational implementation through the ImpactLLM Observatory covering 41 models.
- Emphasizes the importance of transparency and reproducibility in environmental impact assessments.
Read more
Transparent Screening for LLM Inference and Training Impacts
Summary
This paper introduces a transparent screening framework aimed at estimating the inference and training impacts of large language models (LLMs) under conditions of limited observability. The framework translates natural-language application descriptions into bounded environmental estimates, facilitating a comparative online observatory of existing market models. The authors emphasize the importance of transparency and reproducibility in the evaluation of LLMs, particularly in the absence of direct telemetry data from providers. The framework is operationalized through the ImpactLLM Observatory, which currently covers 41 models. The paper details the methodology for extracting scenarios from application descriptions, employing a bounded multi-factor proxy methodology that separates inference and training estimates while making assumptions explicit. The authors argue that this approach is methodologically superior to unverifiable claims and provides a fast, auditable, and interpretable means of estimating environmental impacts. The paper also discusses the design of the inference estimator and training proxy, highlighting the limitations of current assumptions and the need for further research in this area.
Methodology
The methodology involves converting natural-language descriptions of LLM applications into structured scenarios, which are then processed through a bounded multi-factor proxy framework. This framework separates inference and training estimates, making assumptions explicit and providing a comparative analysis of different models. The inference estimator uses a natural-language parser to derive parameters, while the training proxy combines parameter counts with literature-derived training energy estimates.
Results
The framework successfully generates estimates for LLM inference and training impacts, allowing for comparative assessments across various models. The observatory provides a transparent interface for users to inspect and challenge inferred parameters, enhancing the credibility of the estimates. The current implementation includes operational estimates based on observed literature anchors and contextual assumptions.
Implications
The findings suggest that the proposed framework can significantly improve the transparency and comparability of environmental impact assessments for LLMs. This has implications for researchers, developers, and policymakers in understanding the ecological footprint of AI technologies and making informed decisions regarding their deployment and regulation.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
NLP
Large Language Models
Efficient ML
- Introduces sub-token routing for finer control in transformer efficiency.
- Presents a query-independent design that improves language modeling quality.
- Develops a query-aware design that preserves downstream performance with reduced KV budgets.
- Demonstrates the complementary nature of token-level and sub-token-level routing.
Read more
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Summary
This paper introduces a novel approach to improve transformer efficiency through sub-token routing within LoRA-adapted transformers. The authors argue that traditional methods of routing at the token level are insufficient, as they overlook the non-uniform distribution of value groups within tokens. By implementing a fine-grained routing mechanism, the study explores two designs: a query-independent approach for compression-aware language modeling and a query-aware design for downstream-task-preserving KV compression. The query-independent design enhances the quality-compression tradeoff, while the query-aware design maintains downstream performance under reduced KV budgets. The findings reveal that token-level and sub-token-level routing are complementary, allowing for deeper KV compression without sacrificing task accuracy.
Methodology
The authors investigate sub-token routing in LoRA-adapted transformers, focusing on two main designs: a query-independent routing that learns from token representations for language modeling, and a query-aware routing that uses a predictor-based selector to allocate retention budgets based on query-conditioned relevance. Experiments are conducted to evaluate the performance of these designs in terms of quality-compression tradeoffs and downstream task preservation.
Results
The experimental results indicate that the query-independent design significantly improves the quality-compression tradeoff for language modeling tasks. Additionally, the query-aware design successfully preserves downstream task performance even when operating under reduced KV budgets. The analysis shows that combining token-level and sub-token-level routing methods leads to deeper KV compression while maintaining task accuracy.
Implications
This research has significant implications for enhancing the efficiency of transformer models, particularly in scenarios where memory and computational resources are limited. The proposed methods could be applied to various NLP tasks, improving model performance without increasing resource consumption.
Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis
Time Series
- Introduces DAHCL framework to improve fault diagnosis under unseen conditions.
- Addresses cross-domain pseudo-label bias by incorporating domain-specific geometric characteristics.
- Utilizes uncertain samples effectively through fuzzy contrastive supervision.
- Evaluated under realistic noisy conditions to reflect practical industrial scenarios.
Read more
Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis
Summary
The paper addresses the challenges of fault diagnosis in mechanical systems under unseen operating conditions, particularly when labeled data is scarce. It introduces a novel framework called Domain-Aware Hierarchical Contrastive Learning (DAHCL) for Semi-Supervised Domain Generalization Fault Diagnosis (SSDGFD). The framework tackles two main issues: the generation of biased pseudo-labels due to neglecting domain-specific geometric discrepancies, and the inefficient use of unlabeled samples caused by a rigid accept-or-discard strategy. DAHCL incorporates a Domain-Aware Learning (DAL) module that captures geometric characteristics of source domains to calibrate pseudo-label predictions, thus reducing cross-domain bias. Additionally, it features a Hierarchical Contrastive Learning (HCL) module that employs dynamic confidence stratification and fuzzy contrastive supervision, allowing uncertain samples to contribute to representation learning without relying on hard labels. The framework is evaluated under realistic conditions with engineering noise, demonstrating superior robustness and domain generalization capabilities compared to existing SSDGFD methods across three benchmark datasets.
Methodology
The proposed DAHCL framework consists of two main components: a Domain-Aware Learning (DAL) module that calibrates pseudo-labels based on geometric characteristics of source domains, and a Hierarchical Contrastive Learning (HCL) module that combines dynamic confidence stratification with fuzzy contrastive supervision to leverage uncertain samples for representation learning.
Results
Extensive experiments on three benchmark datasets demonstrate that DAHCL consistently outperforms advanced SSDGFD baselines, showing improved robustness and domain generalization capabilities, particularly under severe noise and substantial domain shifts.
Implications
The findings suggest that DAHCL can significantly enhance fault diagnosis in industrial applications, particularly in scenarios with limited labeled data and varying operating conditions. This approach may lead to more reliable and efficient diagnostic systems in real-world settings.
Early Detection of Latent Microstructure Regimes in Limit Order Books
Time Series
Theory
- Introduces a causal regime model for limit order books with identifiable latent build-up phases.
- Derives theoretical guarantees for detection lead-time and probability of early detection.
- Proposes a novel trigger-based detection method that outperforms traditional reactive signals.
- Demonstrates empirical effectiveness through extensive simulations and preliminary real-data applications.
Read more
Early Detection of Latent Microstructure Regimes in Limit Order Books
Summary
This paper addresses the challenge of early detection of stress in limit order books (LOBs), which can transition rapidly from stable to stressed conditions. Traditional early-warning signals are reactive, responding only after stress has begun. The authors propose a three-regime causal data-generating process (DGP) that includes stable, latent build-up, and stress phases, allowing for the identification of a latent build-up phase that precedes stress. They derive two theoretical guarantees regarding detection lead-time and probability of detection before stress onset. A novel trigger-based detector is introduced, which utilizes MAX aggregation of uncertainty and drift channels, a rising-edge condition, and adaptive thresholds. The method is rigorously evaluated through simulations, achieving significant lead-times and high precision compared to existing methods. Preliminary real-data applications on BTC/USDT order book data demonstrate the detector's effectiveness, although performance is noted to degrade in low-SNR and short build-up scenarios. The findings suggest a promising approach to proactive stress detection in high-frequency trading environments.
Methodology
The authors formalize a causal regime model with three phases (stable, latent build-up, stress) and derive theoretical conditions for detection. They develop a trigger-based detector that aggregates uncertainty and drift signals, applying a rising-edge condition and adaptive thresholds. The methodology is evaluated through 200 simulation runs and a preliminary analysis of real BTC/USDT order book data.
Results
The proposed method achieves a mean lead-time of +18.6±3.2 timesteps in simulations with a precision of 1.00 and coverage of 0.54. In real data applications, it achieves a mean lead-time of +38±21 seconds with a precision of 1.00 and coverage of 0.80, outperforming traditional methods. The analysis of missed detections reveals that they are concentrated in specific parameter settings, consistent with theoretical predictions.
Implications
The findings suggest that proactive detection of stress in limit order books is feasible, which could enhance trading strategies and risk management in high-frequency trading environments. The method's theoretical foundations and empirical validation pave the way for further research and potential real-world applications.
Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids
Graph Learning
Time Series
- Introduces a new graph-forming strategy for GNNs that utilizes only measured buses.
- Develops STGNN models based on GraphSAGE and improved GATv2 for fault location.
- Demonstrates significant performance improvements over traditional RNN baselines.
- Shows that the measured-only topology reduces training time and enhances model robustness.
Read more
Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids
Summary
This paper addresses the challenge of fault location in distribution grids, which is critical for maintaining reliability and minimizing outage durations. The authors highlight the limitations of existing methods due to partial observability caused by sparse measurement infrastructure. They propose a novel approach that combines Spatio-temporal Graph Neural Networks (STGNNs) with a new graph-forming strategy that uses only measured buses, contrasting it with traditional full-topology approaches. The study introduces STGNN models based on GraphSAGE and an improved Graph Attention (GATv2) architecture, benchmarking them against state-of-the-art models on the IEEE 123-bus feeder dataset. The results demonstrate that STGNN variants consistently outperform a pure RNN baseline, achieving improvements of up to 11 percentage points in F1 score. Additionally, the proposed measured-only graph topology significantly reduces model training time and enhances performance, suggesting a more efficient and robust framework for fault location in partially observable distribution grids.
Methodology
The authors systematically compare a newly proposed graph construction algorithm that reflects partial observability by using only measured buses against the traditional full-topology approach. They introduce STGNN architectures, specifically RGSAGE and RGATv2, and conduct a quantitative benchmarking study using a dataset of RMS voltage measurements collected from a sparse placement of measurement devices.
Results
All evaluated STGNN variants achieved high performance, consistently outperforming the pure RNN baseline by up to 11 percentage points in F1 score. The STGNN models demonstrated superior stability with tighter confidence intervals compared to the RNN baseline. The measured-only GNN topology resulted in a six-fold reduction in model training time and improved performance.
Implications
The findings suggest that using a measured-only graph topology can provide a more practical and efficient framework for fault location in distribution grids, which is particularly beneficial given the increasing complexity of modern power systems. This approach could lead to faster fault isolation and service restoration, enhancing the reliability of electrical distribution networks.
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Large Language Models
Time Series
Interpretability
- Introduces a novel LLM-guided framework for sepsis early warning.
- Combines spatiotemporal feature extraction with clinical reasoning prompts.
- Achieves superior predictive performance compared to traditional models.
- Provides interpretable predictions that align with clinical judgment.
Read more
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Summary
This paper addresses the critical challenge of timely and interpretable early warning for sepsis, a life-threatening condition characterized by complex physiological dynamics. Traditional models often yield accurate predictions but lack interpretability, which can undermine clinical confidence. The authors propose a novel framework that utilizes a Large Language Model (LLM) to simulate physiological trajectories leading up to sepsis onset. This framework includes a spatiotemporal feature extraction module to capture dynamic dependencies among vital signs, a Medical Prompt-as-Prefix module to embed clinical reasoning into LLMs, and an agent-based post-processing component to ensure predictions remain within physiologically plausible ranges. By simulating key physiological indicators and classifying sepsis onset, the model provides transparent predictions that align with clinical judgment. Evaluated on the MIMIC-IV and eICU databases, the proposed method achieves superior AUC scores (0.861–0.903) across various pre-onset prediction tasks, outperforming conventional deep learning and rule-based approaches. The model not only enhances prediction accuracy but also offers interpretable trajectories and risk trends, aiding clinicians in early intervention and personalized decision-making in intensive care settings.
Methodology
The proposed framework consists of three main components: a spatiotemporal feature extraction module that captures dynamic dependencies among multivariate vital signs, a Medical Prompt-as-Prefix module that integrates clinical reasoning into LLMs, and an agent-based post-processing component that constrains predictions to physiologically plausible ranges. The model first simulates the evolution of key physiological indicators and then classifies sepsis onset based on these trajectories.
Results
The framework was evaluated using the MIMIC-IV and eICU databases, achieving AUC scores ranging from 0.861 to 0.903 across 24–4-hour pre-onset prediction tasks. This performance surpasses that of conventional deep learning and rule-based approaches, demonstrating both accuracy and interpretability in predictions.
Implications
The findings suggest that the LLM-guided simulation framework can significantly improve early sepsis detection and intervention strategies in clinical settings. By providing interpretable predictions, it enhances clinician confidence and supports personalized decision-making, potentially reducing mortality rates associated with sepsis.
FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels
Federated Learning
- FedSIR addresses noisy labels in federated learning through spectral analysis of client feature representations.
- The framework includes a mechanism for identifying clean and noisy clients with minimal communication overhead.
- It employs a relabeling scheme that allows noisy clients to correct their labels based on spectral references from clean clients.
- The integration of noise-aware optimization techniques enhances the stability of training under label noise.
Read more
FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels
Summary
The paper introduces FedSIR, a novel multi-stage framework designed to enhance the robustness of federated learning (FL) in the presence of noisy labels. Traditional methods often rely on loss dynamics or prediction-based strategies, which can be unreliable in FL due to data heterogeneity and client participation challenges. FedSIR leverages the spectral structure of client feature representations to identify and mitigate label noise effectively. The framework consists of three main components: (1) a spectral client identification mechanism that analyzes class-wise feature subspaces to distinguish clean from noisy clients with minimal communication overhead; (2) a spectral relabeling scheme that allows noisy clients to correct their labels using references from clean clients; and (3) a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to stabilize the optimization process. Extensive experiments demonstrate that FedSIR consistently outperforms existing state-of-the-art methods for FL with noisy labels across various benchmarks.
Methodology
FedSIR employs a multi-stage approach that includes spectral client identification to categorize clients based on the consistency of their feature representations, a relabeling strategy that uses spectral references from clean clients to correct labels, and a noise-aware optimization strategy that combines logit-adjusted loss, knowledge distillation, and distance-aware aggregation to improve training stability.
Results
The experiments conducted on federated CIFAR-10 datasets under various label noise rates and client heterogeneity levels show that FedSIR consistently outperforms strong baselines and recent methods for federated learning with noisy labels, demonstrating its effectiveness in real-world scenarios.
Implications
The findings suggest that leveraging structural information from client feature representations can significantly improve the robustness of federated learning systems, particularly in environments where data quality is uncertain. This approach could be beneficial in applications such as healthcare, finance, and any domain where data privacy and label reliability are critical.
ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification
Graph Learning
Time Series
- Identification of crosstalk as a critical issue in graph-based stock ranking.
- Introduction of the ACT framework to systematically address temporal-scale and structural crosstalk.
- Utilization of Temporal Component Decomposition (TCD) for disentangling stock sequences.
- Implementation of a Progressive Structural Purification Encoder for structural crosstalk mitigation.
Read more
ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification
Summary
The paper introduces the Anti-CrossTalk (ACT) framework to address the challenges of crosstalk in cross-sectional stock ranking, which is crucial for quantitative investment. Crosstalk refers to unintended information interference across predictive factors, identified in two forms: temporal-scale crosstalk and structural crosstalk. Temporal-scale crosstalk arises when trends, fluctuations, and shocks are entangled in a shared representation, while structural crosstalk occurs when heterogeneous relationships among stocks are fused indiscriminately, obscuring relation-specific signals. The ACT framework mitigates these issues through temporal disentanglement and structural purification. It decomposes stock sequences into trend, fluctuation, and shock components, utilizing dedicated branches to extract component-specific information. A Progressive Structural Purification Encoder is employed to purify structural crosstalk on the trend component, followed by an adaptive fusion module that integrates all branch representations for ranking. Experiments on the CSI 300 and CSI 500 datasets demonstrate that ACT achieves state-of-the-art ranking accuracy and portfolio performance, with significant improvements over existing methods.
Methodology
The ACT framework employs Temporal Component Decomposition (TCD) to separate stock sequences into trend, fluctuation, and shock components. It uses dedicated branches (FCI for fluctuations and SCI for shocks) to extract specific information and mitigate non-transferable dynamics. A Progressive Structural Purification Encoder (PSPE) is introduced to purify structural crosstalk on the trend component, followed by an adaptive fusion module to combine the representations for final ranking.
Results
ACT outperforms 16 competitive baselines in cross-sectional stock ranking accuracy and portfolio returns, achieving improvements of up to 74.25% on the CSI 300 dataset. The extensive experiments validate the effectiveness of the anti-crosstalk designs through comprehensive ablation studies.
Implications
The findings suggest that addressing crosstalk can significantly enhance the performance of stock ranking models, potentially leading to better investment strategies and improved financial forecasting. The methodologies developed could also be applicable to other domains where interdependencies and temporal dynamics are present.
Generative Augmentation of Imbalanced Flight Records for Flight Diversion Prediction: A Multi-objective Optimisation Framework
Generative Models
Optimization
- Introduces a multi-objective optimisation framework for hyperparameter tuning of generative models in the context of rare flight diversion events.
- Demonstrates the necessity of a comprehensive evaluation framework for assessing synthetic data quality beyond single metrics.
- Shows that models trained on a combination of real and synthetic data significantly outperform those trained only on real data.
- Explores the impact of different augmentation sizes on the predictive quality of rare event predictions.
Read more
Generative Augmentation of Imbalanced Flight Records for Flight Diversion Prediction: A Multi-objective Optimisation Framework
Summary
This paper addresses the challenge of predicting flight diversions, which are rare but significant events in aviation. The authors propose a novel approach that utilizes generative models to augment historical flight data with synthetic diversion records, thereby enhancing the training of machine learning models. Given the extreme class imbalance in flight records (only 127 diversions out of 61,000 flights), the study introduces a multi-objective optimisation framework to fine-tune hyperparameters for three deep generative models: Tabular Variational Autoencoder (TVAE), Conditional Tabular Generative Adversarial Network (CTGAN), and CopulaGAN, with a Gaussian Copula model as a baseline. The quality of the synthetic data is evaluated through a comprehensive six-stage framework that assesses realism, diversity, operational validity, statistical similarity, fidelity, and predictive utility. The results indicate that the optimised generative models significantly outperform non-optimised versions, and that the inclusion of synthetic data markedly improves the predictive accuracy of diversion events compared to models trained solely on real data. This research highlights the potential of synthetic data augmentation in improving predictive modelling for rare events in aviation, ultimately contributing to enhanced safety and operational efficiency.
Methodology
The authors trained and optimised multiple generative models on a subset of flight diversion data, generating synthetic records to augment the training dataset. They employed a six-stage evaluation framework to assess the quality of the synthetic data and its impact on predictive performance.
Results
The optimised generative models demonstrated significant improvements in predictive accuracy for flight diversions compared to non-optimised models. The inclusion of synthetic data led to better generalisation and detection of diversion events, even in the absence of strongly correlated features.
Implications
The findings suggest that synthetic data augmentation can be a powerful tool for improving predictive modelling of rare events in aviation, which can enhance safety and operational efficiency. This approach may also be applicable in other domains facing similar challenges with rare event prediction.
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach
Time Series
Efficient ML
Audio & Speech
- R-DCNN allows for denoising of periodic signals with varying frequencies using a single training observation.
- The method significantly reduces computational complexity compared to traditional deep learning and classical autoregressive methods.
- R-DCNN is optimized for low-power applications, making it suitable for IoT and edge devices.
- The approach maintains high accuracy in signal denoising without the need for retraining on new observations.
Read more
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach
Summary
This paper presents a novel method for denoising periodic signals using a Dilated Convolutional Neural Network (DCNN) combined with a resampling technique, termed R-DCNN. Traditional deep learning approaches for signal processing often require substantial computational resources and are trained separately for each signal observation. The proposed R-DCNN method is designed to operate under strict power and resource constraints, making it suitable for deployment in low-power environments such as IoT devices. The method allows for the processing of signals with varying fundamental frequencies using a single observation for training, and it employs a lightweight resampling step to align time scales across different signals. The results demonstrate that R-DCNN achieves performance comparable to state-of-the-art classical methods, such as autoregressive techniques, while significantly reducing computational complexity. This efficiency makes R-DCNN particularly advantageous for applications requiring real-time processing and low resource consumption.
Methodology
The methodology involves using a one-dimensional Dilated Convolutional Neural Network (DCNN) that incorporates a resampling technique to adjust the time axis of periodic signals. The model is trained on a single observation, and during inference, it applies the learned weights to re-sampled observations, allowing for effective denoising without retraining.
Results
The experiments show that R-DCNN provides high accuracy in denoising periodic signals while achieving lower computational complexity than existing methods, including both deep learning DCNNs and classical autoregressive techniques. The results indicate that the proposed method can effectively handle signals with varying fundamental frequencies.
Implications
The findings suggest that R-DCNN can be effectively utilized in various applications requiring periodic signal processing, such as speech recognition, medical diagnostics, and sonar systems, particularly in environments with limited computational resources.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
NLP
Large Language Models
Efficient ML
- DLE is a deterministic method that replaces stochastic sampling in inference tasks.
- It systematically explores previously unvisited high-probability branches, improving coverage of the search space.
- DLE reduces redundant token generation, leading to more efficient use of computational resources.
- Empirical results show that DLE achieves better performance on math, coding, and general reasoning tasks compared to self-consistency.
Read more
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Summary
This paper introduces Distinct Leaf Enumeration (DLE), a novel deterministic decoding method aimed at enhancing inference efficiency in constrained domains such as math and coding. Traditional self-consistency methods, which sample multiple reasoning traces and vote on the best outcome, are found to be compute-inefficient due to repetitive sampling of high-probability prefixes. DLE addresses this inefficiency by systematically exploring a pruned decoding tree to enumerate distinct leaves, thereby avoiding redundant token generation and maximizing coverage of the truncated search space under a fixed compute budget. The authors demonstrate that DLE outperforms stochastic self-consistency by generating higher-quality reasoning traces, leading to improved performance on various tasks. The method is shown to be particularly effective in scenarios where repeated sampling is wasteful, making it a suitable replacement for self-consistency in code synthesis and tightly specified math problems. Overall, DLE not only enhances the quality of generated outputs but also reduces inference time, especially in memory-constrained environments.
Methodology
The authors propose DLE as a deterministic enumeration method that explores a pruned decoding tree to generate distinct reasoning chains without duplicates. This approach contrasts with traditional stochastic sampling methods, which often revisit the same prefixes and generate redundant outputs. DLE leverages a fixed compute budget to maximize the exploration of high-probability branches, thereby enhancing the efficiency of the inference process.
Results
Empirical evaluations demonstrate that DLE provides higher coverage of the truncated search space compared to stochastic self-consistency, resulting in improved performance on tasks related to math, coding, and general reasoning. Additionally, DLE is shown to be more token-efficient, generating more complete sequences under the same token budget and leading to practical latency improvements in inference systems.
Implications
The introduction of DLE has significant implications for the efficiency of large language models, particularly in applications requiring high precision and low redundancy, such as code generation and mathematical problem-solving. By reducing computational waste and improving output quality, DLE can enhance the practical deployment of language models in real-world scenarios.
An effective variant of the Hartigan $k$-means algorithm
Optimization
Theory
Efficient ML
- Smartigan improves upon Hartigan's algorithm, providing an additional 2-5% improvement in clustering performance.
- The algorithm encourages exploration in the initial stages of clustering, leading to better convergence.
- Smartigan maintains theoretical stability guarantees similar to Hartigan's method.
- Empirical results show that Smartigan consistently outperforms both Hartigan and Lloyd's algorithms.
Read more
An effective variant of the Hartigan $k$-means algorithm
Summary
This paper presents a novel variant of Hartigan's k-means algorithm, referred to as 'Smartigan', which aims to improve clustering performance over traditional methods like Lloyd's algorithm. The authors highlight that while Hartigan's algorithm already provides a 5-10% improvement over Lloyd's method, their proposed variation can yield an additional 2-5% improvement, particularly as the dimensionality of the data or the number of clusters increases. The Smartigan algorithm modifies the original Hartigan method by introducing a mechanism that encourages exploration during the clustering process, especially in the initial stages when the cluster structure is not well-defined. The paper provides a detailed description of the algorithm, including its initialization, iteration process, and the conditions under which points are reassigned to different clusters. The authors also discuss the theoretical guarantees of the Smartigan algorithm, establishing that it maintains stability in cluster assignments similar to Hartigan's method while enhancing performance. Through empirical evaluations, the paper demonstrates that Smartigan consistently outperforms Hartigan, thus providing a more effective approach to the k-means clustering problem.
Methodology
The Smartigan algorithm is a variation of Hartigan's k-means method that incorporates a random permutation of points for evaluation, enhancing the exploration of cluster assignments. The algorithm iteratively checks whether moving a point to a different cluster decreases the k-means functional, updating centroids accordingly while ensuring stability in assignments.
Results
The empirical results indicate that Smartigan achieves a clear performance improvement over Hartigan's algorithm, particularly in higher dimensions and with more clusters. The authors provide numerical evidence supporting the effectiveness of their method compared to traditional k-means approaches.
Implications
The findings suggest that Smartigan could be applied in various clustering scenarios where traditional k-means methods struggle, particularly in high-dimensional data or complex clustering tasks. This could enhance applications in fields such as data mining, image processing, and any domain requiring effective clustering solutions.
Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment
Time Series
- CCSS-RS effectively simulates WWTP responses under various control scenarios while managing irregular and missing data.
- The model achieves a 40-46% reduction in RMSE compared to Neural CDE baselines, showcasing its superior predictive capabilities.
- Operational case studies highlight the model's practical utility in real-world decision-making for wastewater treatment operators.
- The architecture of CCSS-RS is tailored to the specific data conditions of full-scale WWTP operations, avoiding the need for recalibration of mechanistic models.
Read more
Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment
Summary
This paper addresses the need for effective decision support tools in wastewater treatment plants (WWTPs) through the development of a data-driven open-loop simulation model named CCSS-RS (Controlled Continuous-time State-space Model with Regime Switching). The model is designed to simulate plant responses under various control plans while accommodating irregular and missing sensor data. It separates historical state inference from future control actions, allowing for accurate long-term predictions over planning horizons of 12-36 hours. The CCSS-RS model incorporates advanced techniques such as typed context encoding, gain-weighted forcing of control inputs, and semigroup-consistent rollouts to handle the complexities of real-world WWTP data. The model was evaluated on the Avedøre WWTP dataset, achieving significant improvements in predictive accuracy compared to existing baselines. The results demonstrate the model's potential for operational decision support in industrial wastewater treatment, complementing traditional mechanistic models.
Methodology
The CCSS-RS model employs a controlled continuous-time state-space framework that differentiates between state variables, control inputs, and exogenous variables. It utilizes typed context encoding to handle irregular observations, a gain-weighted forcing mechanism for injecting control inputs, and regime-specific experts to capture phase-dependent dynamics. The model is evaluated against established baselines and simplified variants using a comprehensive dataset from the Avedøre WWTP.
Results
The CCSS-RS model achieved a Root Mean Square Error (RMSE) of 0.696 and Continuous Ranked Probability Score (CRPS) of 0.349 across 10,000 test windows, significantly outperforming Neural CDE models and simplified variants. The model demonstrated operational value in case studies, accurately predicting ammonium levels and effectively managing sensor outages.
Implications
The development of CCSS-RS provides a robust tool for wastewater treatment operators to conduct scenario screening and decision-making without the need for extensive recalibration of mechanistic models. This approach aligns with Industry 4.0 initiatives in smart-water management, enhancing the operational efficiency and responsiveness of wastewater treatment processes.
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
Computer Vision
Theory
Efficient ML
- Introduces JEPAMatch, a new semi-supervised learning framework that enhances representation learning.
- Addresses class imbalance and convergence speed issues prevalent in existing methods like FixMatch.
- Utilizes a latent-space regularization term to promote isotropic Gaussian structures in the representation space.
- Demonstrates superior performance on benchmark datasets, achieving faster convergence and reduced computational costs.
Read more
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
Summary
The paper presents JEPAMatch, a novel approach for semi-supervised learning (SSL) that addresses the limitations of existing methods, particularly those derived from FixMatch. While FixMatch has achieved state-of-the-art results in image classification, it suffers from issues related to class imbalance and slow convergence due to reliance on fixed confidence thresholds for pseudo-labeling. JEPAMatch introduces a paradigm shift by focusing on the geometric representation of data in latent space, inspired by the Latent-Euclidean Joint-Embedding Predictive Architectures (LeJEPA). This method combines a semi-supervised loss with a latent-space regularization term to encourage isotropic Gaussian structures in the representation space. By decoupling the classification training from the geometric organization of the representation, JEPAMatch improves the model's ability to learn informative features and decision boundaries. Extensive experiments on benchmark datasets such as CIFAR-100, STL-10, and Tiny-ImageNet demonstrate that JEPAMatch consistently outperforms existing baselines, accelerates convergence, and reduces computational costs compared to traditional FixMatch-based approaches.
Methodology
JEPAMatch integrates a new training objective that combines a classical semi-supervised loss with a latent-space regularization term. It operates on two complementary levels: a Curriculum Level for pseudo-label selection and a Representation Level for structuring the learned feature space. This dual-level optimization allows for improved classification performance and mitigates class imbalance effects.
Results
The experiments conducted on CIFAR-100, STL-10, and Tiny-ImageNet show that JEPAMatch consistently outperforms existing baseline methods, achieving better classification accuracy while significantly accelerating convergence and reducing computational costs compared to standard FixMatch-based pipelines.
Implications
The findings suggest that JEPAMatch could be effectively applied in various domains where labeled data is scarce, such as medical imaging and natural language processing, by improving the efficiency and effectiveness of semi-supervised learning models.
Tokenised Flow Matching for Hierarchical Simulation Based Inference
Efficient ML
Theory
- Introduces Tokenised Flow Matching for Posterior Estimation (TFMPE) to enhance simulation efficiency in hierarchical SBI.
- Utilizes likelihood factorisation to train from single-site simulations, reducing the need for multiple simulator evaluations.
- Validates the proposed method on a new benchmark and real-world models, showing improved calibration and reduced computational costs.
- Addresses the practical bottleneck of simulator evaluations in hierarchical settings with shared parameters.
Read more
Tokenised Flow Matching for Hierarchical Simulation Based Inference
Summary
This paper addresses the computational challenges associated with Simulation Based Inference (SBI), particularly in hierarchical models where simulations can be costly. The authors propose a novel approach called Tokenised Flow Matching for Posterior Estimation (TFMPE), which utilizes likelihood factorisation to improve simulation efficiency. Unlike traditional hierarchical SBI methods that require multiple site simulations per training sample, TFMPE learns a per-site neural surrogate of the simulator, allowing for the assembly of synthetic multi-site observations from single-site simulations. This method is particularly beneficial in scenarios with shared global parameters and exchangeable site-level parameters. The authors introduce a benchmark for hierarchical SBI to facilitate systematic evaluation of their approach. They validate TFMPE on this benchmark as well as on realistic models in infectious disease and computational fluid dynamics, demonstrating that it produces well-calibrated posteriors while significantly reducing computational costs. Overall, the paper contributes to the field by providing a more efficient framework for hierarchical SBI, which can be applied to various scientific domains.
Methodology
The authors employ likelihood factorisation to create a per-site neural surrogate for the simulator, enabling the generation of synthetic multi-site observations from single-site simulations. This approach is integrated into a tokenised framework that allows for flexible handling of function-valued observations.
Results
The TFMPE method was validated against a newly introduced benchmark for hierarchical SBI and demonstrated well-calibrated posterior distributions. It also showed a significant reduction in computational costs when applied to realistic infectious disease and computational fluid dynamics models.
Implications
The findings suggest that TFMPE can be a valuable tool for researchers in fields requiring hierarchical simulation-based inference, such as epidemiology and fluid dynamics, by making the inference process more efficient and accessible.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
NLP
Large Language Models
Efficient ML
- High-variance activation directions are not indicative of importance for model predictions.
- Block linearity is conditional and can be disrupted by changes in earlier blocks.
- Direct quantization is more effective than weight factorization for reducing errors.
- Linearity in transformer blocks increases with depth, indicating a shift from nonlinear to linear processing.
Read more
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
Summary
This paper presents a comprehensive empirical study on the compression of transformer models, specifically GPT-2 and Mistral 7B, through over 40 experiments. The research identifies five key structural properties that influence the compressibility of transformers: (1) High-variance activation directions do not correlate with predictive importance, as shown by canonical correlation analysis (CCA); (2) The linearity of transformer blocks is conditional on the upstream distribution, meaning that modifications in earlier blocks can degrade performance; (3) Weight factorization amplifies quantization errors, indicating that direct quantization is superior; (4) Linearity increases with depth, revealing a division of labor between early nonlinear feature construction and late linear refinement; and (5) A significant portion of tokens (30%) are computationally easy to process. The study demonstrates that single-block linear replacements can achieve substantial compression with minimal increase in perplexity, while multi-block replacements fail due to error accumulation. The findings suggest that traditional static post-training compression methods face inherent limitations, advocating instead for adaptive computation strategies tailored to individual tokens. The paper concludes with practical guidance for compression practitioners based on the identified structural properties.
Methodology
The study employs a series of experiments on two transformer models, GPT-2 and Mistral 7B, to analyze their compressibility. It utilizes techniques such as canonical correlation analysis (CCA) to assess the relationship between activation variance and predictive importance, ridge regression for measuring block linearity, and PCA for analyzing activation dimensionality. Additionally, the research includes sensitivity analysis and early exit mechanisms to evaluate computational efficiency.
Results
The findings reveal that high-variance activation directions do not correlate with predictive accuracy, and that modifying earlier transformer blocks can negatively impact performance. The study also shows that single-block linear replacements can achieve a 34× compression rate with only a slight increase in perplexity, while multi-block replacements are hindered by error accumulation. Overall, the research indicates that static compression methods are limited and highlights the potential of adaptive computation strategies.
Implications
The results have significant implications for the design of compression algorithms for large language models, suggesting a shift towards adaptive computation methods that allocate resources based on token difficulty rather than relying solely on static weight compression techniques. This could lead to more efficient deployment of transformer models in practical applications.
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
Graph Learning
- Introduces ResGIN-Att, a novel model for drug synergy prediction.
- Integrates molecular features and genomic profiles with drug-drug interactions.
- Utilizes residual connections to mitigate over-smoothing in deep layers.
- Employs a cross-attention mechanism for improved interpretability.
Read more
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
Summary
This paper addresses the challenge of predicting drug synergy, which is crucial for developing effective combination therapies in treating complex diseases. Traditional single-drug therapies often lead to limited efficacy and drug resistance, making computational prediction methods essential for exploring drug combinations. The authors propose a novel model called Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att) that combines molecular structural features, cell-line genomic profiles, and drug-drug interactions to enhance synergy prediction. The ResGIN-Att model employs a residual graph isomorphism network to extract multi-scale topological features of drug molecules, mitigating over-smoothing in deep layers. An adaptive Long Short-Term Memory (LSTM) module integrates structural information across scales, while a cross-attention module models drug-drug interactions and identifies key chemical substructures. The model is evaluated on five public benchmark datasets, demonstrating competitive performance against existing methods, improved generalization capability, and robustness. This work highlights the potential of deep learning and graph neural networks in advancing drug synergy prediction without relying on predefined biological knowledge.
Methodology
The methodology involves a Residual Graph Isomorphism Network (ResGIN) for extracting topological features of drug molecules, combined with an adaptive LSTM for integrating structural information. A cross-attention mechanism is employed to explicitly model drug-drug interactions and identify significant chemical substructures.
Results
The ResGIN-Att model achieved competitive performance on five benchmark datasets, outperforming key baseline methods. It demonstrated enhanced generalization capability and robustness, indicating its effectiveness in predicting drug synergy.
Implications
The findings suggest that the ResGIN-Att model can significantly aid in the computational prediction of drug combinations, potentially accelerating the development of effective combination therapies and improving treatment outcomes for complex diseases.
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
NLP
Reinforcement Learning
Efficient ML
- DR-Venus is a 4B parameter deep research agent trained entirely on 10K open data.
- The training methodology includes a two-stage process: supervised fine-tuning followed by reinforcement learning.
- DR-Venus outperforms existing models with 9B parameters and narrows the performance gap with larger 30B-class systems.
- The study highlights the importance of data quality and effective utilization in training small models.
Read more
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
Summary
The paper presents DR-Venus, a 4B parameter deep research agent designed for edge-scale deployment using only 10K open data. The authors address the challenge of training effective small language models under limited data by enhancing both data quality and utilization. The training process consists of two stages: first, agentic supervised fine-tuning (SFT) is employed to establish basic capabilities through strict data cleaning and resampling of long-horizon trajectories. Second, agentic reinforcement learning (RL) is applied to improve execution reliability in complex tasks. The RL component utilizes IGPO and introduces turn-level rewards based on information gain and format-aware regularization, which enhances supervision density and credit assignment. DR-Venus-4B demonstrates superior performance compared to existing models with 9B parameters across multiple benchmarks and shows promising results that approach larger 30B-class systems. The findings indicate that small models can achieve significant performance with careful data handling and suggest that test-time scaling can further unlock their potential. The authors provide their models, code, and training recipes to promote reproducible research in this area.
Methodology
The methodology involves a two-stage training pipeline: first, agentic supervised fine-tuning (SFT) on cleaned and resampled trajectories to build basic capabilities, followed by agentic reinforcement learning (RL) with turn-level credit assignment to enhance execution reliability in long-horizon tasks.
Results
DR-Venus-4B-SFT establishes a strong baseline, outperforming prior agentic systems with 9B parameters on most benchmarks. With reinforcement learning, DR-Venus-4B-RL further improves performance, setting a new standard among small agents and demonstrating potential comparable to larger models.
Implications
The findings suggest that small deep research agents can be effectively trained on limited open data, making them suitable for real-world applications where cost, latency, and privacy are critical. This work opens avenues for further research on lightweight, deployable AI systems.
A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing
Reinforcement Learning
Optimization
Efficient ML
- DAOEF addresses the performance degradation in multi-agent systems when scaling beyond 100 agents.
- The framework integrates three mechanisms that work synergistically to improve efficiency.
- Controlled experiments validate the interdependence of the mechanisms, showing that removing any one increases latency significantly.
- DAOEF achieves a 62% reduction in latency and 62% energy savings in a 200-agent cloud deployment.
Read more
A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing
Summary
This paper addresses the 'Synergistic Collapse' phenomenon in multi-agent edge computing, where performance degrades significantly when scaling beyond 100 agents. The authors present DAOEF (Delta-Aware Orchestration for Edge Federations), a framework that simultaneously tackles three interrelated challenges: exponential action-space growth, computational redundancy among adjacent agents, and task-agnostic hardware scheduling. The framework incorporates three key mechanisms: Differential Neural Caching, which improves cache hit ratios by storing intermediate layer activations; Criticality-Based Action Space Pruning, which reduces coordination complexity; and Learned Hardware Affinity Matching, which optimally assigns tasks to hardware accelerators. Through controlled experiments, the authors demonstrate that each mechanism is essential for performance gains, achieving a 1.45× improvement over independent implementations. DAOEF shows significant latency reductions and energy savings in various deployments, confirming its effectiveness in enhancing scalability in multi-agent systems.
Methodology
The authors developed DAOEF, which includes Differential Neural Caching for improved cache efficiency, Criticality-Based Action Space Pruning to reduce coordination complexity, and Learned Hardware Affinity Matching for optimal task assignment. They conducted controlled experiments to isolate the effects of each mechanism and validated the framework through simulations and physical deployments.
Results
DAOEF achieved a 1.45× multiplicative gain over independent implementations of the three mechanisms. In a 200-agent cloud deployment, it resulted in a 62% reduction in latency (from 735 ms to 280 ms) and 62% energy savings (44.7 MWh/year). The framework demonstrated sub-linear latency growth up to 250 agents, while traditional methods saturated at 80 agents.
Implications
The findings suggest that DAOEF can significantly enhance the scalability and efficiency of multi-agent edge computing systems, particularly in smart city applications involving large-scale deployments of surveillance cameras and autonomous vehicles. This framework could lead to cost savings and improved performance in real-world scenarios.
Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples
Computer Vision
- Batch effects significantly degrade model performance in biomedical imaging.
- CS-ARM-BN is a novel meta-learning method that utilizes negative control samples for adaptation.
- The proposed method achieves a high accuracy of 0.935 ± 0.018 on new experimental batches.
- Traditional deep learning models fail to generalize across different batches, highlighting the need for effective domain adaptation.
Read more
Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples
Summary
This paper addresses the significant challenge of batch effects in biomedical imaging, which hinder the reproducibility and effectiveness of deep learning models across different experimental batches. The authors introduce a novel method called Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), which leverages negative control samples present in each batch to provide a stable context for model adaptation. The method is validated on the Mechanism-of-Action (MoA) classification task using the large-scale JUMP-CP dataset. Results indicate that traditional ResNet models experience a substantial drop in accuracy when applied to new experimental batches, from 0.939 ± 0.005 to 0.862 ± 0.060. In contrast, the proposed meta-learning approach achieves an accuracy of 0.935 ± 0.018, effectively closing the domain gap. The findings suggest that incorporating control samples can stabilize meta-learning approaches, making them more robust to domain shifts, such as those arising from different laboratories. This work demonstrates that principled in-context adaptation can neutralize batch effects in biomedical imaging, enhancing the practical usability of deep learning models in real-world applications.
Methodology
The authors propose a meta-learning adaptation method called Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), which utilizes negative control samples as stable references for adapting deep learning models to new experimental batches. The method is validated through experiments on the JUMP-CP dataset for MoA classification, comparing its performance against standard ResNet models and other domain adaptation techniques.
Results
The study shows that standard ResNet models experience a significant accuracy drop from 0.939 ± 0.005 to 0.862 ± 0.060 when tested on new experimental batches. In contrast, the CS-ARM-BN method achieves an accuracy of 0.935 ± 0.018, effectively closing the performance gap. This indicates that the proposed method is more effective than traditional approaches, including foundation models and other domain adaptation techniques.
Implications
The findings suggest that incorporating control samples into the training process can enhance the robustness of deep learning models in biomedical imaging, making them more reliable for real-world applications. This approach could facilitate better generalization across different experimental conditions, ultimately improving the reproducibility of biomedical research.
A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs
Reinforcement Learning
Optimization
- Proposes a hierarchical MARL framework for DER participation in electricity markets.
- Facilitates P2P trading among prosumers to enhance market efficiency.
- Utilizes a Stackelberg game model for coordination of market participation.
- Addresses challenges of integrating DERs into existing electricity market structures.
Read more
A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs
Summary
This paper addresses the challenges posed by the decentralization of the electric energy sector, particularly the integration of distributed energy resources (DERs) into electricity markets. The authors propose a hierarchical multi-agent reinforcement learning (MARL) framework that enables individual prosumers to engage in peer-to-peer (P2P) retail auctions while also facilitating their participation in wholesale markets through aggregation. The framework aims to enhance market efficiency and operational flexibility by coordinating the actions of prosumers using a Stackelberg game model. The study highlights the importance of local electricity markets (LEMs) and P2P trading as innovative solutions to improve grid stability and reduce costs associated with traditional utility models. By leveraging advanced data analytics and communication networks, the proposed approach seeks to optimize bidding strategies and market transactions, ultimately contributing to a more resilient and efficient energy market.
Methodology
The authors employ a hierarchical multi-agent deep reinforcement learning approach, integrating a Stackelberg game model to coordinate the actions of prosumers in both retail and wholesale markets. This methodology allows for decentralized decision-making while maintaining overall market coherence.
Results
The proposed framework demonstrates improved market performance through enhanced coordination among prosumers, leading to optimized bidding strategies and more effective participation in both retail and wholesale markets. The results indicate that the hierarchical MARL approach can significantly alleviate operational challenges and improve grid efficiency.
Implications
The findings suggest that adopting a hierarchical MARL framework can facilitate the integration of DERs into electricity markets, promoting decentralized energy trading and enhancing grid resilience. This approach could inform future policies and market designs aimed at optimizing energy distribution and consumption.
Replicable Bandits with UCB based Exploration
Theory
- Introduction of replicable algorithms for stochastic multi-armed and linear bandits.
- Development of RepUCB and RepLinUCB algorithms that achieve low regret while ensuring replicability.
- Introduction of RepRidge, a replicable ridge regression estimator with confidence guarantees.
- Significant improvement in regret bounds compared to prior methods, particularly in linear bandits.
Read more
Replicable Bandits with UCB based Exploration
Summary
This paper investigates replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits utilizing Upper Confidence Bound (UCB) based exploration. A bandit algorithm is defined as ρ-replicable if it produces the same action sequence across two executions with shared internal randomness but independent reward realizations, with a probability of at least 1 - ρ. The authors propose two main algorithms: RepUCB for stochastic MABs and RepLinUCB for stochastic linear bandits, both designed to achieve replicability while minimizing regret. They also introduce RepRidge, a replicable ridge regression estimator that provides confidence and replicability guarantees, which is integral to the RepLinUCB algorithm. The paper highlights the importance of replicability in machine learning, particularly in high-stakes decision-making contexts, and presents optimistic approaches that improve upon previous methods in terms of regret bounds and computational efficiency.
Methodology
The authors develop optimistic algorithms that leverage UCB principles for action selection in bandit settings. They introduce RepUCB for MABs and RepLinUCB for linear bandits, utilizing a replicable mean-estimation oracle and a replicable ridge regression estimator, respectively. The algorithms are designed to maintain replicability while minimizing regret through batched action selection and confidence guarantees.
Results
RepUCB achieves a regret bound of O(K² log² T / ρ² Σa:∆a>0 (∆a + log(KT log T) / ∆a)), while RepLinUCB improves the regret bound to eO(d + d³/ρ)√T, outperforming previous methods by a factor of O(d/ρ). The results demonstrate that the proposed algorithms can significantly reduce the trade-offs associated with replicability.
Implications
The findings underscore the necessity for replicable algorithms in machine learning, particularly in applications where consistent outcomes are critical, such as healthcare and scientific research. The proposed methods can enhance the reliability of decision-making processes by ensuring that algorithms yield stable results across different executions.
Graph-Theoretic Models for the Prediction of Molecular Measurements
Graph Learning
- The Mukwembi-Nyabadza model was evaluated on five benchmark datasets, confirming its limited transferability.
- A systematic enhancement framework significantly improved the model's predictive performance.
- Enhanced classical models matched or outperformed deep learning approaches in molecular property prediction.
- The proposed framework is computationally efficient and accessible for resource-limited researchers.
Read more
Graph-Theoretic Models for the Prediction of Molecular Measurements
Summary
This paper explores the effectiveness of graph-theoretic models for predicting molecular properties, specifically evaluating the Mukwembi-Nyabadza model, which utilizes external activity D(G) and internal activity ζ(G) indices. The study assesses this baseline model on five diverse datasets from MoleculeNet, including biological activity, lipophilicity, aqueous solubility, and hydration free energy. The baseline model achieved an average R² of 0.24, indicating limited transferability to larger datasets. To enhance performance, the authors proposed a systematic enhancement framework that incorporates Ridge regularization, additional graph descriptors, physico-chemical properties, ensemble learning, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models significantly improved performance, raising the average best R² to 0.79, with individual improvements ranging from 165% to 274%. Comparisons with a Graph Convolutional Network (GCN) showed that the enhanced classical models matched or outperformed deep learning methods across all datasets. The framework is efficient, requiring no GPU and training in under five minutes, making it accessible for researchers with limited resources.
Methodology
The study employed a systematic enhancement framework that progressively integrated various techniques, including Ridge regularization, additional graph descriptors, and ensemble learning methods. The baseline Mukwembi-Nyabadza model was evaluated on five datasets, and enhancements were made to improve its predictive accuracy.
Results
The baseline model achieved an average R² of 0.24, while the enhanced models raised the average best R² to 0.79, with statistically significant improvements (p < 0.001). The enhanced models outperformed deep learning methods, including a GCN, across all datasets.
Implications
The findings suggest that classical graph-theoretic models, when systematically enhanced, can achieve competitive performance in molecular property prediction without the high computational costs associated with deep learning methods. This has significant implications for drug discovery and computational chemistry, particularly in resource-limited settings.
Fast Bayesian equipment condition monitoring via simulation based inference: applications to heat exchanger health
Efficient ML
Theory
Time Series
- Introduces a fast Bayesian framework for equipment condition monitoring using Simulation-Based Inference.
- Achieves 82x faster inference times compared to traditional MCMC methods while maintaining diagnostic accuracy.
- Demonstrates applicability to complex failure modes in heat exchangers, particularly fouling and leakage.
- Establishes a scalable workflow for real-time monitoring of industrial systems.
Read more
Fast Bayesian equipment condition monitoring via simulation based inference: applications to heat exchanger health
Summary
This paper presents a novel framework for condition monitoring of industrial equipment, specifically focusing on heat exchangers. The authors address the challenge of inferring latent degradation parameters from indirect sensor measurements, which is crucial for accurate diagnostics in industrial settings. Traditional Bayesian methods, such as Markov Chain Monte Carlo (MCMC), are limited by their computational demands, making them impractical for real-time applications. To overcome this, the authors propose a Simulation-Based Inference (SBI) approach that utilizes amortized neural posterior estimation. This method allows for likelihood-free inference by training neural density estimators on simulated datasets, enabling a direct mapping from thermal-fluid observations to the posterior distribution of degradation parameters. The framework is benchmarked against MCMC across various synthetic scenarios, demonstrating that SBI achieves comparable diagnostic accuracy while significantly reducing inference time by a factor of 82. The results indicate that SBI is a scalable and efficient alternative for real-time fault diagnosis in complex engineering systems, establishing its potential for broader applications in predictive maintenance and condition monitoring.
Methodology
The authors employ Simulation-Based Inference (SBI) powered by amortized neural posterior estimation. They train neural density estimators on a simulated dataset to create a likelihood-free mapping from thermal-fluid observations to the posterior distribution of degradation parameters. A systematic comparison is made between the SBI approach and traditional MCMC sampling methods.
Results
The SBI framework demonstrated comparable diagnostic accuracy to MCMC while achieving an 82-fold increase in inference speed. This efficiency allows for near-instantaneous posterior characterization, making it suitable for real-time applications in condition monitoring.
Implications
The proposed SBI framework can significantly enhance the efficiency of condition monitoring in industrial settings, particularly for systems where direct measurement of health parameters is not feasible. Its scalability and speed make it a viable option for predictive maintenance across various industrial processes, including legacy systems.
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
Reinforcement Learning
Large Language Models
NLP
- Medium-frequency responses are a major source of spurious reward signals in TTRL.
- Group-relative advantage normalization amplifies these spurious signals.
- DDRL framework effectively mitigates spurious signals through sampling, debiasing, and consensus refinement.
- Extensive experiments show DDRL outperforms existing TTRL baselines significantly.
Read more
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
Summary
This paper addresses the challenges of spurious signal amplification in Test-Time Reinforcement Learning (TTRL) for mathematical reasoning tasks. The authors identify that TTRL, which adapts models at inference time using pseudo-labeling, is susceptible to noise from ambiguous responses, particularly those with medium consistency. They observe that these spurious signals can be amplified through group-relative advantage estimation. To counteract this issue, the authors propose a novel framework called Debiased and Denoised test-time Reinforcement Learning (DDRL). DDRL employs a frequency-based sampling strategy to filter out ambiguous samples while ensuring a balanced representation of positive and negative examples. It also introduces a debiased advantage estimation method that mitigates the bias from group-relative optimization. Finally, DDRL includes a consensus-based off-policy refinement stage that utilizes a rejection-sampled dataset for stable model updates. The effectiveness of DDRL is demonstrated through experiments on various large language models across multiple mathematical reasoning benchmarks, showing significant performance improvements over existing TTRL methods.
Methodology
The authors developed DDRL, which includes three main components: a balanced confidence-aware sampling strategy to exclude ambiguous samples, a debiased advantage estimation method to eliminate bias from group-relative optimization, and a consensus-based off-policy refinement stage that utilizes a rejection-sampled dataset for stable updates.
Results
Experiments conducted on three large language models across multiple mathematical reasoning benchmarks revealed that DDRL achieved significant performance improvements, with relative gains of 15.3% on Qwen2.5-MATH-1.5B and 12.7% on LLaMA-3.1-8B-Instruct compared to existing TTRL methods.
Implications
The findings suggest that addressing spurious signals in reinforcement learning can lead to more reliable model adaptations in challenging reasoning tasks, potentially enhancing the performance of large language models in various applications.
PrismaDV: Automated Task-Aware Data Unit Test Generation
Theory
Efficient ML
- PrismaDV generates task-aware data unit tests by analyzing downstream task code and dataset profiles.
- The SIFTA framework optimizes prompt generation for improved task adaptation.
- PrismaDV outperforms existing task-agnostic and task-aware data validation frameworks.
- The system addresses common shortcomings in current data unit testing approaches.
Read more
PrismaDV: Automated Task-Aware Data Unit Test Generation
Summary
PrismaDV is introduced as a novel AI system designed to enhance data validation by generating task-aware data unit tests. Unlike existing frameworks that are task-agnostic, PrismaDV analyzes downstream task code alongside dataset profiles to identify data access patterns and infer implicit data assumptions. This allows for the creation of executable data unit tests tailored to specific tasks. Additionally, the authors propose a framework called Selective Informative Feedback for Task Adaptation (SIFTA), which optimizes prompts for the system based on the outcomes of data unit tests and downstream tasks. The evaluation of PrismaDV on two benchmarks covering 60 tasks across five datasets demonstrates its superiority over both task-agnostic and task-aware baselines in generating relevant unit tests. The results indicate that SIFTA can automatically learn effective prompts that outperform manually crafted or generic prompts. The authors also make their benchmarks and prototype implementation publicly available, contributing to the field of automated data validation.
Methodology
PrismaDV employs a compound AI system that integrates analysis of downstream task code with dataset profiling to identify data access patterns and generate specialized data unit tests. The SIFTA framework is utilized to optimize prompts based on execution outcomes, enhancing the adaptability of the generated tests over time.
Results
PrismaDV was evaluated on two benchmarks consisting of 60 tasks across five datasets, consistently outperforming both task-agnostic and task-aware baselines in generating relevant unit tests. The SIFTA framework demonstrated the ability to learn effective prompts that surpassed manually created prompts, indicating significant advancements in automated data validation.
Implications
The development of PrismaDV has the potential to streamline data validation processes in enterprises, reducing the manual effort required for maintaining data unit tests and improving the reliability of downstream applications. Its task-aware approach could lead to more robust data handling practices and minimize the risks associated with data errors in production systems.
Rethinking Intrinsic Dimension Estimation in Neural Representations
Theory
- Commonly used ID estimators are biased and do not track true IDs in neural representations.
- The discrepancy between theory and practice in ID estimation is significant, particularly in high dimensions.
- The paper characterizes manifolds of LLM embeddings and hidden layer representations.
- Layer-wise ID patterns are influenced by various underlying factors, challenging previous interpretations.
Read more
Rethinking Intrinsic Dimension Estimation in Neural Representations
Summary
This paper critically examines the estimation of intrinsic dimensions (IDs) in neural representations, highlighting significant discrepancies between theoretical expectations and practical outcomes. The authors argue that common ID estimators are biased and do not accurately reflect the true underlying IDs of neural representations. Through both theoretical analysis and empirical investigation, they demonstrate that these estimators fail to track the actual IDs, particularly in high-dimensional spaces and when data lies on multiple manifolds. The paper also characterizes the manifolds of large language model (LLM) embeddings and hidden layer representations, revealing insights into the factors influencing layer-wise ID patterns. The authors propose a new perspective on ID estimation, challenging existing interpretations of ID-related results in the literature.
Methodology
The authors conducted a theoretical analysis of common ID estimators, demonstrating their biases in high-dimensional spaces. They also performed empirical investigations on neural representations across various datasets and architectures, analyzing the behavior of IDs in different layers of neural networks. The study included a characterization of manifolds associated with LLM embeddings and hidden layer representations.
Results
The findings reveal that common ID estimators do not accurately reflect the true intrinsic dimensions of neural representations. The authors provide evidence that these estimators are biased and fail to account for the complexity of data lying on multiple manifolds. Additionally, they uncover the driving forces behind observed layer-wise ID patterns, suggesting that previous interpretations of these patterns may be misleading.
Implications
The insights from this paper could lead to improved methodologies for estimating intrinsic dimensions in neural representations, enhancing the understanding of neural network behavior. This could have implications for model interpretability and the development of more effective neural architectures, particularly in high-dimensional data contexts.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
Computer Vision
Reinforcement Learning
Multimodal
- Introduction of a Propose-then-Critic framework for GUI grounding.
- Utilization of a co-evolutionary reinforcement learning strategy to balance prediction accuracy and diversity.
- Significant improvements in grounding accuracy and critic capabilities, with up to 17.2% enhancement.
- Dynamic maturity mechanism to adaptively guide the learning process.
Read more
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
Summary
This paper addresses the challenge of Graphical User Interface (GUI) grounding, which involves mapping natural language instructions to precise pixel coordinates on a screen. Traditional methods often struggle with localization due to visually homogeneous elements and dense layouts. The authors propose a novel Propose-then-Critic framework that replaces static self-consistency strategies with a learnable selection mechanism. This mechanism critiques the model's own proposals rendered on the screenshot, allowing for a more accurate selection of targets. The framework employs a co-evolving reinforcement learning paradigm that dynamically balances the training objectives of the proposer and critic. The proposer generates diverse candidates, while the critic evaluates and ranks these candidates, fostering mutual reinforcement. The proposed method significantly enhances grounding accuracy and critic reliability, achieving a relative improvement of up to 17.2% in grounding capability across multiple benchmarks. The paper emphasizes the importance of a maturity-aware mechanism that adapts the learning focus as the model matures, ensuring robust generalization across complex interface layouts.
Methodology
The authors developed a Propose-then-Critic framework that transforms GUI grounding into a Visual Perception Ranking paradigm. This involves generating diverse candidate outputs, visualizing them for feedback, and using a critic to rank and select the most accurate target. A co-evolutionary reinforcement learning strategy is employed, with a maturity-aware mechanism that adjusts the training focus based on the model's development.
Results
The proposed method demonstrated a relative improvement of up to 17.2% in grounding capability across six benchmarks, significantly enhancing both the accuracy of generated candidates and the reliability of the critic's evaluations.
Implications
The findings suggest that the proposed framework can improve the performance of autonomous GUI agents, making them more effective in interpreting user instructions and executing commands in complex digital environments. This has potential applications in digital automation and user interface design.
Relative Entropy Estimation in Function Space: Theory and Applications to Trajectory Inference
Theory
Generative Models
Time Series
- Introduces a framework for estimating KL divergence in function space, addressing limitations of traditional snapshot-based evaluations.
- Validates the functional KL estimator against known analytic KL divergences, demonstrating robustness and accuracy.
- Shows that existing snapshot-based metrics can yield inconsistent rankings in trajectory inference methods.
- Establishes functional KL as a coherent criterion for evaluating trajectory inference, particularly under sparse data conditions.
Read more
Relative Entropy Estimation in Function Space: Theory and Applications to Trajectory Inference
Summary
This paper addresses the challenge of trajectory inference (TI) in single-cell genomics, where only snapshot data is available due to destructive measurements. The authors propose a novel framework for estimating the Kullback–Leibler (KL) divergence between probability measures in function space, leading to a scalable and data-driven estimator. The framework is validated against benchmark datasets, demonstrating that the functional KL divergence closely aligns with analytic KL divergence. The authors apply this method to both synthetic and real single-cell RNA sequencing (scRNA-seq) datasets, revealing inconsistencies in existing evaluation metrics and highlighting the advantages of using functional KL for coherent comparisons of trajectory inference methods. The results indicate that functional KL is a principled criterion for evaluating TI under conditions of partial observability, particularly in regions with sparse or missing data.
Methodology
The authors utilize Functional Flow Matching (FFM) to represent trajectory distributions through functional flows. They derive a tractable estimator for KL divergence based on absolute continuity and trace-class noise covariances, allowing for practical application on realistic datasets. The framework is validated through extensive evaluations on synthetic and real datasets, comparing the performance of various TI methods.
Results
The functional KL estimator closely matches analytic KL divergence in benchmark tests. Evaluations of trajectory inference methods reveal that traditional snapshot-based metrics can lead to inconsistent rankings, while the functional KL provides a coherent comparison that highlights discrepancies in inferred dynamics, especially in sparse data regions.
Implications
This work has significant implications for the field of trajectory inference, particularly in single-cell genomics, by providing a more reliable evaluation metric that can lead to better understanding and reconstruction of latent dynamical processes. It encourages the adoption of functional KL divergence in future studies and applications where trajectory data is incomplete or sparse.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Optimization
Computer Vision
NLP
- GEM activation functions achieve ReLU-like performance with improved smoothness.
- Three variants of GEM are introduced: GEM, E-GEM, and SE-GEM, each with unique properties.
- N = 1 is optimal for standard-depth networks, while N = 2 is preferred for transformers.
- GEM surpasses GELU in specific benchmarks, marking a significant advancement in activation function design.
Read more
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Summary
This paper introduces a new family of activation functions called Geometric Monomial (GEM), which are designed to be C2N-smooth and provide an alternative to commonly used activation functions like ReLU and GELU. The GEM functions are based on a log-logistic cumulative distribution function (CDF) and are constructed to maintain the advantages of ReLU while improving smoothness for better gradient-based optimization in deep neural networks. The author presents three variants: GEM (the base family), E-GEM (an ε-parameterized generalization), and SE-GEM (a piecewise variant that addresses dead neurons). An N-ablation study identifies N = 1 as optimal for standard-depth networks, significantly reducing the performance gap with GELU on CIFAR-100. The paper also explores the trade-offs between CNNs and transformers regarding the smoothness parameter N. The proposed activation functions are evaluated across various benchmarks, demonstrating competitive performance against existing alternatives.
Methodology
The paper employs a theoretical framework to define the GEM family of activation functions, focusing on their differentiability and performance characteristics. It includes an N-ablation study to determine the optimal smoothness parameter and conducts empirical evaluations across multiple benchmark datasets, comparing GEM with existing activation functions like ReLU, GELU, and Swish.
Results
GEM and its variants demonstrate superior performance in various tasks: E-GEM achieves a tie with the best baseline on MNIST, SE-GEM outperforms GELU on CIFAR-10, and E-GEM reduces the GELU deficit on CIFAR-100 significantly. On GPT-2, GEM achieves the lowest perplexity compared to GELU, and E-GEM achieves the best validation loss on BERT-small.
Implications
The introduction of GEM activation functions could lead to more efficient training of deep neural networks, particularly in applications requiring smooth optimization landscapes. This may enhance performance in both computer vision and natural language processing tasks, making GEM a valuable addition to the toolkit of machine learning practitioners.
The Path Not Taken: Duality in Reasoning about Program Execution
Large Language Models
- Current benchmarks for LLMs focus too narrowly on single execution paths, limiting their evaluation of program understanding.
- The proposed duality framework includes both forward and backward reasoning tasks to better assess LLMs' causal understanding of program execution.
- DEXBENCH, the new benchmark introduced, consists of 445 paired instances for comprehensive evaluation of LLMs.
- Results indicate that dual-path reasoning is a more reliable measure of LLMs' reasoning capabilities compared to traditional single-path evaluations.
Read more
The Path Not Taken: Duality in Reasoning about Program Execution
Summary
This paper addresses the limitations of current benchmarks for evaluating large language models (LLMs) in understanding program execution. Existing benchmarks primarily focus on predicting program properties based on specific inputs, which can lead to a narrow view of dynamic code reasoning and potential data contamination. The authors propose a duality in reasoning about program execution, consisting of two complementary tasks: predicting a program's observed behavior for a given input (forward reasoning) and inferring how the input must be mutated to achieve a specific behavioral objective (backward reasoning). They introduce DEXBENCH, a benchmark with 445 paired instances designed to evaluate these dual reasoning tasks. The evaluation of 13 LLMs reveals that dual-path reasoning serves as a robust proxy for dynamic code understanding, highlighting that strong performance in isolated tasks does not guarantee success in joint evaluations. The findings suggest that a deeper causal understanding of execution flow is necessary for effective program reasoning.
Methodology
The authors developed DEXBENCH, a dual-path benchmark that includes both forward reasoning (predicting observed behavior) and backward reasoning (inferring necessary input mutations). They evaluated 13 LLMs using this benchmark, which consists of real-world programs extracted from existing datasets. The evaluation focused on the models' ability to understand causal relationships in program execution.
Results
The evaluation revealed that models that performed well on execution or counterfactual reasoning in isolation did not necessarily excel in joint evaluations. This underscores the limitations of single-path evaluation benchmarks. The dual-path reasoning approach provided a more reliable proxy for assessing LLMs' causal, state-aware reasoning of execution flow.
Implications
The findings suggest that improving LLMs' understanding of program execution requires a shift towards evaluating their reasoning capabilities in a more nuanced manner. This could lead to better performance in software engineering tasks, such as code generation, debugging, and testing.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
NLP
Large Language Models
Reinforcement Learning
- R2IF introduces a hybrid reward design that optimizes both reasoning quality and function-call correctness.
- The Chain-of-Thought Effectiveness Reward (CER) enhances tool-call stability without relying on subjective evaluations.
- The Specification-Modification-Value (SMV) reward explicitly supervises parameter constraints and transformations.
- R2IF shows significant performance improvements across multiple benchmarks, indicating its robustness and scalability.
Read more
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
Summary
The paper presents R2IF, a novel reasoning-aware reinforcement learning (RL) framework designed to enhance the interpretability and accuracy of function calling in large language models (LLMs). Traditional RL approaches often face challenges due to misalignment between reasoning processes and tool-call decisions, leading to unreliable interactions with external tools. R2IF addresses this issue by introducing a composite reward system that integrates format/correctness constraints, a Chain-of-Thought Effectiveness Reward (CER), and a Specification-Modification-Value (SMV) reward. This framework is optimized using Grouped Proximal Policy Optimization (GRPO) to ensure that reasoning aligns closely with the requirements of tool specifications. Experimental results on BFCL and ACEBench benchmarks demonstrate that R2IF significantly outperforms existing methods, achieving up to a 34.62% improvement in function-calling accuracy while maintaining high interpretability. The findings indicate that R2IF not only enhances the stability of tool calls but also improves the overall reasoning quality of LLMs, making it a promising approach for reliable tool-augmented deployments.
Methodology
R2IF employs a composite reward system that combines format/correctness constraints, a distribution-based Chain-of-Thought Effectiveness Reward (CER), and a Specification-Modification-Value (SMV) reward. This system is optimized through Grouped Proximal Policy Optimization (GRPO), providing trajectory-level supervision to enhance the alignment between reasoning and tool-call decisions.
Results
R2IF achieved up to a 34.62% improvement in function-calling accuracy on the BFCL benchmark using the Llama3.2-3B model. It also demonstrated a positive Average CoT Effectiveness score of 0.05 for the same model, outperforming baseline methods. The framework maintained over 96% accuracy in rejecting irrelevant tool calls and showed balanced improvements across various task categories.
Implications
The R2IF framework has significant implications for the deployment of LLMs in real-world applications, particularly in scenarios requiring reliable interactions with external tools. Its focus on aligning reasoning with tool specifications can enhance user trust and system robustness, making it suitable for complex task-solving environments.
Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
Federated Learning
- PINA addresses both data heterogeneity and privacy in federated learning without requiring privileged server data or random restarts.
- The framework utilizes privacy-preserving sketches of client updates for accurate cluster prototype initialization.
- A normality-driven aggregation mechanism is introduced to improve robustness against imbalanced client participation.
- PINA consistently outperforms existing differential privacy federated learning methods in terms of accuracy.
Read more
Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
Summary
This paper presents PINA, a novel framework for clustered federated learning (CFL) that integrates differential privacy (DP) while addressing the challenges of data heterogeneity in federated learning environments. Traditional federated learning can leak user information through model updates, and while differential privacy can mitigate this risk, it often results in noisy updates that hinder model performance, especially in non-IID data scenarios. PINA introduces a two-stage approach: first, it allows clients to fine-tune a low-rank adaptation (LoRA) adapter and share a compressed sketch of their updates, which the server uses to create robust cluster centroids. In the second stage, PINA employs a normality-driven aggregation mechanism that enhances convergence and robustness against imbalanced client contributions. The proposed method retains the advantages of clustered federated learning while ensuring formal privacy guarantees, making it suitable for practical applications. Extensive evaluations demonstrate that PINA outperforms existing state-of-the-art DP-FL algorithms by an average of 2.9% in accuracy across various privacy budgets, particularly in realistic non-IID settings.
Methodology
PINA employs a two-stage framework where the first stage involves clients fine-tuning a lightweight LoRA adapter and sharing compressed sketches of their updates. The server uses these sketches to initialize robust cluster centroids. In the second stage, a normality-driven aggregation mechanism is applied to enhance the convergence and robustness of the model against imbalanced contributions from clients.
Results
The extensive evaluations indicate that PINA outperforms state-of-the-art differential privacy federated learning algorithms by an average accuracy improvement of 2.9% across various privacy budgets (ϵ ∈{2, 8}), particularly in realistic non-IID data scenarios.
Implications
The advancements presented in PINA could facilitate the practical adoption of federated learning in sensitive applications where data privacy is paramount, such as healthcare and finance, by providing robust privacy guarantees while maintaining model performance.