AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
67
Papers today
8h
Update frequency
7
Days of history
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Reinforcement Learning
Large Language Models
Optimization
- Introduces the Four Quadrant Decomposition framework for analyzing token updates in RLVR.
- Establishes a theoretical upper bound on token credit based on entropy using Conditional Mutual Information.
- Demonstrates that reasoning improvements are primarily driven by high-entropy tokens.
- Proposes Entropy-Aware Policy Optimization (EAPO) to optimize token-level learning signals.
Read more
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Summary
This paper addresses the credit assignment problem in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs), where sparse outcome-based rewards lead to challenges in determining which tokens in a sequence contribute to the final outcome. The authors introduce a diagnostic framework called the Four Quadrant Decomposition, which analyzes token updates based on reward polarity (positive or negative) and token entropy (predictive uncertainty). Through this framework, they establish that reasoning improvements are concentrated in high-entropy quadrants, where tokens carry more credit. The authors adapt Conditional Mutual Information (CMI) to the RLVR context, proving that the credit a token can carry is upper-bounded by its entropy. This theoretical foundation leads to the development of Entropy-Aware Policy Optimization (EAPO), which modulates token-level learning signals according to their entropy. Extensive experiments show that EAPO outperforms existing baselines across various benchmarks, demonstrating the effectiveness of their approach in enhancing reasoning capabilities in LLMs.
Methodology
The authors employ the Four Quadrant Decomposition to isolate token updates by reward polarity and entropy. They adapt Conditional Mutual Information to formalize the relationship between tokens and rewards, establishing theoretical predictions about credit assignment. They also conduct gradient analysis to understand the dynamics of credit allocation in existing methods like GRPO. Finally, they implement EAPO, which adjusts learning signals based on token entropy.
Results
The experiments demonstrate that EAPO significantly outperforms strong baselines in mathematical reasoning and out-of-domain benchmarks. The findings confirm that useful credit is concentrated at high-entropy positions, with positive updates enhancing exploration and generalization, while negative updates help prune erroneous branches.
Implications
The insights from this research could lead to more effective training strategies for LLMs, particularly in applications requiring enhanced reasoning and decision-making capabilities. The proposed methods may also be applicable to other areas of reinforcement learning and optimization, where credit assignment is a critical challenge.
Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems
Time Series
Efficient ML
Theory
- Introduction of the Physics-Informed State Space Model (PISSM) for solar irradiance forecasting.
- Utilization of dynamic Hankel matrix embedding to filter noise from meteorological data.
- Replacement of heavy RNNs and attention mechanisms with a Linear State Space Model for efficiency.
- Implementation of a Physics-Informed Gating mechanism to ensure predictions adhere to physical laws.
Read more
Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems
Summary
This paper addresses the need for accurate and computationally efficient solar irradiance forecasting in off-grid photovoltaic irrigation systems, particularly in semi-arid regions like Omdurman, Sudan. Traditional deep learning methods, such as RNNs and Transformers, often struggle with computational overhead and fail to incorporate atmospheric physics, leading to unrealistic predictions. The author introduces the Physics-Informed State Space Model (PISSM), which utilizes a dynamic Hankel matrix embedding to convert raw meteorological data into a structured state space, effectively filtering out noise. The model replaces complex attention mechanisms with a Linear State Space Model to efficiently capture long-range dependencies through continuous differential equations. A novel Physics-Informed Gating mechanism is also proposed, which uses deterministic variables like the Solar Zenith Angle and Clearness Index to ensure that predictions remain within physical limits. Evaluated on a multi-year NASA POWER dataset, PISSM demonstrates superior long-term memory and physical accuracy while maintaining a lightweight architecture with fewer than 40,000 trainable parameters, setting a new standard for real-time control in resource-constrained microgrids.
Methodology
The methodology involves transforming raw meteorological sequences into a robust state space using a dynamic Hankel matrix embedding. A Linear State Space Model is employed to model temporal dependencies, while a Physics-Informed Gating mechanism incorporates deterministic astronomical variables to constrain outputs within physical limits.
Results
The PISSM architecture shows improved long-term memory and physical accuracy in solar irradiance forecasting compared to traditional models, with a significantly reduced parameter count of under 40,000, making it suitable for edge deployment in off-grid systems.
Implications
The findings suggest that PISSM can enhance the reliability of solar energy management in off-grid systems, potentially improving agricultural practices in semi-arid regions by providing accurate energy forecasts for irrigation systems.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
Generative Models
- Identifies limitations of traditional discriminative critics in LLM reinforcement learning.
- Introduces Generative Actor-Critic (GenAC) to enhance value modeling through chain-of-thought reasoning.
- Implements In-Context Conditioning for better alignment between critic and actor during training.
- Demonstrates superior performance in mathematical reasoning benchmarks compared to existing methods.
Read more
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
Summary
This paper addresses the challenge of credit assignment in reinforcement learning (RL) for large language models (LLMs), which has been hindered by the limitations of traditional discriminative critics. The authors argue that the difficulties in training value models stem from their limited expressiveness under the one-shot prediction paradigm. To overcome this, they propose a novel approach called Generative Actor-Critic (GenAC), which utilizes a generative critic that engages in chain-of-thought reasoning before estimating values. Additionally, they introduce In-Context Conditioning to ensure the critic remains aligned with the actor throughout training. The proposed method demonstrates improved value approximation, ranking reliability, and out-of-distribution generalization, leading to enhanced performance in downstream RL tasks compared to both value-based and value-free baselines. The findings suggest that stronger value modeling can significantly enhance credit assignment in LLM reinforcement learning, paving the way for more effective training signals in complex applications.
Methodology
The authors analyze the limitations of standard discriminative critics using theoretical and empirical perspectives, proposing GenAC as a solution. GenAC replaces traditional one-shot scalar value predictions with a generative critic that performs reasoning before producing value estimates. In-Context Conditioning is introduced to maintain calibration between the critic and the actor throughout the training process.
Results
GenAC achieves superior sample efficiency and continues to improve in training regimes where baseline methods plateau. It provides more reliable action rankings, better out-of-distribution generalization, and interpretable credit assignments consistent with reasoning quality, outperforming both value-based and value-free methods in mathematical reasoning benchmarks.
Implications
The findings indicate that enhancing value modeling in LLM reinforcement learning can lead to more effective training signals, which is crucial for applications requiring complex reasoning and longer context interactions, such as multi-turn conversations and agentic systems.
INCRT: An Incremental Transformer That Determines Its Own Architecture
NLP
Theory
Efficient ML
- INCRT dynamically adjusts its architecture during training, addressing structural redundancy in Transformers.
- The model starts with a single attention head and adds or prunes heads based on real-time performance metrics.
- Two theorems underpin the architecture's design, ensuring minimal and sufficient configurations.
- Experimental validation shows INCRT can outperform BERT-base on specific tasks with fewer parameters.
Read more
INCRT: An Incremental Transformer That Determines Its Own Architecture
Summary
This paper presents INCRT (Incremental Transformer), a novel architecture that dynamically adjusts its structure during training to optimize performance and reduce redundancy. Traditional Transformer models suffer from structural redundancy, where a significant number of attention heads can be pruned without loss of performance, due to fixed hyperparameters set before training. INCRT addresses this issue by starting with a single attention head and incrementally adding heads as needed based on a geometric quantity derived from the task's directional structure. The architecture is guided by two key theorems: homeostatic convergence, which ensures that the model reaches a minimal and sufficient configuration, and a compressed-sensing analogy that bounds the number of heads based on the task's spectral complexity. Experimental results on tasks such as SARS-CoV-2 variant classification and SST-2 sentiment analysis demonstrate that INCRT achieves comparable or superior performance to BERT-base while utilizing significantly fewer parameters and without requiring pre-training. This approach not only enhances efficiency but also aligns the model's architecture more closely with the specific demands of the task.
Methodology
INCRT employs an incremental approach to architecture design, starting with one attention head and adding heads based on a geometric criterion derived from the task's requirements. The growth and pruning decisions are made in real-time during training, guided by two theoretical theorems that ensure the model reaches an optimal configuration.
Results
The experiments conducted on SARS-CoV-2 variant classification and SST-2 sentiment analysis showed that the predicted head counts from INCRT were within 12% of the observed counts. The final architectures achieved performance levels that matched or exceeded BERT-base while using three to seven times fewer parameters and without the need for pre-training.
Implications
The findings suggest that INCRT could lead to more efficient Transformer models that are better tailored to specific tasks, reducing computational costs and improving performance. This approach may also influence future research in neural architecture design and optimization.
Vestibular reservoir computing
Time Series
Efficient ML
Theory
- Introduction of an uncoupled reservoir topology for reservoir computing.
- Derivation of a memory capacity formula for linear reservoirs.
- Demonstration of performance equivalence between uncoupled and fully coupled networks.
- Exploration of the effects of reservoir size on predictive performance.
Read more
Vestibular reservoir computing
Summary
This paper introduces a novel physical reservoir computing (RC) scheme inspired by the biological vestibular system, aiming to simplify the hardware complexity associated with traditional reservoir architectures. The authors propose an uncoupled topology for the reservoir that achieves performance comparable to fully coupled networks. They derive a memory capacity formula for linear reservoirs, identifying conditions under which both uncoupled and coupled configurations exhibit equivalent memory. The study also explores the impact of reservoir size on predictive statistics and memory capacity, demonstrating that uncoupled reservoirs can provide a mathematically sound and practically feasible approach for efficient physical reservoir computing. The findings suggest that this new architecture can effectively facilitate the implementation of RC in physical systems, enhancing the potential for real-world applications.
Methodology
The authors theoretically analyze the differences between coupled and uncoupled reservoir topologies, deriving a memory capacity formula for linear reservoirs. They also conduct systematic examinations of how reservoir size influences predictive statistics and memory capacity, validating their findings through analytical results that approximate for nonlinear reservoir systems.
Results
The study finds that uncoupled reservoir architectures can achieve performance levels similar to fully coupled networks while simplifying hardware requirements. The derived memory capacity formula indicates specific conditions under which both configurations yield equivalent memory, and the analysis shows that larger reservoir sizes enhance predictive capabilities.
Implications
The proposed uncoupled reservoir computing framework offers a promising pathway for efficient physical implementations of reservoir computing, potentially leading to advancements in analog and hybrid computational technologies. This could have significant implications for applications in nonlinear time-series forecasting and the development of digital twins for complex dynamical systems.
Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
Multimodal
- Identifies the fragility of the assumption that independently estimated view uncertainties are comparable.
- Proposes TMUR, which decouples evidence extraction from fusion arbitration to improve multi-view classification.
- Employs a unified router to generate sample-level expert weights based on global context.
- Demonstrates through experiments that TMUR consistently outperforms existing methods in classification performance and reliability.
Read more
Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
Summary
This paper addresses the challenges in trusted multi-view classification, where multiple views independently generate class evidence and uncertainty for predictions. The authors highlight a critical assumption in existing methods: that evidence from different views is numerically comparable. They argue that this assumption is often violated due to differences in feature spaces, noise levels, and semantic granularity across views. To tackle this issue, the authors propose a novel framework called Trusted Multi-view learning with Unified Routing (TMUR). TMUR separates the processes of evidence extraction and fusion arbitration, utilizing view-specific experts alongside a global expert to generate sample-level weights based on the overall multi-view context. This approach mitigates biases introduced by scale differences in evidence across views. The paper includes theoretical analyses supporting the need for global routing over branch-local uncertainty and presents extensive experiments across 14 datasets, demonstrating that TMUR significantly enhances both classification performance and reliability compared to 15 recent baselines.
Methodology
The authors developed TMUR, which consists of view-private experts and a collaborative expert. A unified router is employed to observe the global multi-view context and generate sample-level weights for the experts. The methodology emphasizes soft load-balancing and diversity regularization to enhance expert utilization and specialization.
Results
The extensive experiments conducted on 14 datasets showed that TMUR outperformed 15 recent baselines in terms of both classification accuracy and reliability. The results confirmed the effectiveness of the proposed unified routing approach in addressing the issues of evidence scale incomparability.
Implications
The findings suggest that TMUR can be applied in various domains requiring multi-view classification, such as image and text analysis, where different modalities may provide complementary information. The framework's ability to dynamically assess view reliability could enhance decision-making processes in real-world applications.
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
Reinforcement Learning
Large Language Models
Theory
- Traditional entropy regularization can lead to suboptimal policies due to persistent bias.
- Covariance-based methods selectively regularize high-covariance tokens, achieving better performance.
- The paper establishes a unified framework for understanding entropy dynamics in RL.
- Covariance-based methods maintain stability margins, crucial for reasoning tasks.
Read more
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
Summary
This paper presents a theoretical analysis of entropy control methods in reinforcement learning (RL) for enhancing reasoning in large language models (LLMs). The authors identify the challenge of rapid policy entropy collapse during training, which leads to premature convergence and performance saturation. They compare two entropy control strategies: traditional entropy regularization and a novel covariance-based mechanism. By establishing a unified framework for entropy dynamics under softmax parameterization, the paper derives expressions for entropy change based on the covariance between log-probabilities and logit updates. The analysis reveals that traditional methods introduce persistent bias, resulting in suboptimal policies, while covariance-based methods selectively regularize high-covariance tokens, achieving asymptotic unbiasedness when the regularization coefficient is adjusted. The covariance-based approach also maintains the stability margin of the base policy gradient, offering significant advantages for reasoning tasks. The findings provide theoretical guidelines for entropy control in LLM post-training, with implications for scaling RL to larger models and complex reasoning tasks.
Methodology
The authors develop a unified mathematical framework for analyzing entropy dynamics under softmax policy parameterization. They derive exact expressions for entropy change based on the covariance between log-probabilities and logit updates. The paper compares the structural, convergence, and stability properties of traditional and covariance-based entropy control methods through theoretical proofs and empirical validation.
Results
The analysis confirms that covariance-based entropy control methods outperform traditional regularization by achieving asymptotic unbiasedness and preserving stability margins. Empirical results validate the theoretical predictions, demonstrating that covariance-based methods significantly mitigate entropy collapse and enhance downstream performance in reasoning tasks.
Implications
The findings suggest that adopting covariance-based entropy control can improve the scalability and effectiveness of reinforcement learning in large language models, particularly for complex reasoning tasks. This could lead to better performance in applications requiring sophisticated reasoning capabilities, such as mathematical problem-solving and coding tasks.
Rethinking the Diffusion Model from a Langevin Perspective
Generative Models
Theory
Optimization
- Introduces a Langevin perspective to simplify the understanding of diffusion models.
- Unifies ODE-based and SDE-based diffusion models under a single framework.
- Demonstrates the theoretical superiority of diffusion models compared to ordinary VAEs.
- Clarifies the equivalence of flow matching, denoising, and score matching under maximum likelihood.
Read more
Rethinking the Diffusion Model from a Langevin Perspective
Summary
This paper presents a novel perspective on diffusion models through the lens of Langevin dynamics, aiming to simplify the understanding of these models for both beginners and experienced researchers. The authors systematically organize the theory of diffusion models, addressing key questions about the relationship between forward and reverse processes, the unification of ODE-based and SDE-based models, and the comparative advantages of diffusion models over traditional variational autoencoders (VAEs). By framing diffusion processes as operations that can be split into noising and denoising phases, the paper elucidates the equivalence of various modeling approaches, including flow matching and score matching, under maximum likelihood. The Langevin perspective is shown to provide clear answers to complex theoretical questions, enhancing pedagogical value and bridging existing interpretations of diffusion models.
Methodology
The authors utilize stochastic differential equations (SDEs) to derive the forward and reverse processes of diffusion models, framing Langevin dynamics as an identity operation on distributions. This approach allows for a straightforward derivation of the reverse process and facilitates the conversion between different model types.
Results
The paper successfully demonstrates that the Langevin perspective offers a clearer understanding of diffusion models, bridging various interpretations and showing how different formulations can be interrelated. It also establishes that flow matching is not fundamentally simpler than denoising or score matching, but rather equivalent under maximum likelihood.
Implications
The findings suggest that the Langevin perspective can serve as a foundational framework for teaching and researching diffusion models, potentially leading to improved methodologies in generative modeling and enhancing the accessibility of complex theoretical concepts.
Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
Computer Vision
Optimization
Theory
- Introduction of Soft Silhouette Loss as a differentiable objective for representation learning.
- The loss encourages intra-class compactness and inter-class separation without increasing computational complexity.
- Combining Soft Silhouette Loss with cross-entropy and supervised contrastive learning yields superior performance.
- Empirical results show consistent improvements across multiple image classification benchmarks.
Read more
Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
Summary
This paper addresses the limitations of traditional cross-entropy loss in supervised deep learning, particularly its inability to enforce desirable geometric properties in the embedding space, such as intra-class compactness and inter-class separation. The authors introduce Soft Silhouette Loss, a novel differentiable objective inspired by the silhouette coefficient from clustering analysis. Unlike existing metric learning approaches that rely on pairwise or proxy-based relationships, Soft Silhouette Loss evaluates each sample against all classes in a batch, promoting a global structure in the representation space. The proposed loss can be combined with cross-entropy and is complementary to supervised contrastive learning. The authors present a hybrid objective that integrates both local pairwise consistency and global cluster structure. Extensive experiments on seven diverse datasets demonstrate that augmenting cross-entropy with Soft Silhouette Loss consistently improves performance over traditional methods. The hybrid formulation outperforms supervised contrastive learning alone, achieving a top-1 accuracy of 39.08%, significantly higher than the baseline methods while maintaining lower computational overhead. This work highlights the potential of classical clustering principles as differentiable objectives for deep learning, enabling efficient optimization of both local and global structures in representation spaces.
Methodology
The authors propose a differentiable silhouette-based objective that evaluates each sample against all classes in a batch, promoting a global structure in the embedding space. This objective is integrated with traditional cross-entropy loss and supervised contrastive learning to form a hybrid optimization framework. The methodology involves extensive experiments on various datasets to validate the effectiveness of the proposed loss function.
Results
The integration of Soft Silhouette Loss with cross-entropy led to a top-1 accuracy improvement from 36.71% (cross-entropy) and 37.85% (supervised contrastive learning) to 39.08%. The hybrid approach outperformed existing metric learning methods, demonstrating significant performance gains while incurring lower computational costs.
Implications
The findings suggest that incorporating classical clustering metrics into deep learning can enhance the quality of learned representations, making it beneficial for tasks such as retrieval, transfer learning, and open-set recognition. This approach may lead to more efficient and effective models in various applications within computer vision and beyond.
WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees
Efficient ML
Interpretability
- WOODELF-HD improves the computational efficiency of Background SHAP for high-depth decision trees.
- The algorithm reduces the preprocessing bottleneck from cubic to quadratic complexity with respect to tree depth.
- It enables exact SHAP value computation for decision trees with depths up to 21, overcoming limitations of previous methods.
- Significant speedups (up to 162×) are achieved over existing state-of-the-art algorithms for deep trees.
Read more
WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees
Summary
This paper introduces WOODELF-HD, an extension of the WOODELF algorithm designed to efficiently compute Background SHAP values for high-depth decision trees. Traditional methods for computing SHAP values, particularly Background SHAP, struggle with scalability due to their time complexity, which includes an O(mn) component. Recent advancements like WOODELF and PLTREESHAP improved this to O(m + n) but still faced limitations with deep trees due to a preprocessing bottleneck that scales cubically with tree depth. WOODELF-HD addresses this issue by reducing the complexity from O(3D) to O(2D) through a Strassen-like multiplication scheme that optimizes matrix-vector operations. This allows for exact Background SHAP computations for trees with depths up to 21, significantly surpassing previous methods that fail beyond depth 15. The algorithm also merges path nodes with identical features to further optimize memory usage. Experimental results demonstrate that WOODELF-HD achieves speedups of 33× and 162× for ensembles of depths 12 and 15, respectively, compared to state-of-the-art methods.
Methodology
WOODELF-HD employs a Strassen-like multiplication scheme to optimize matrix-vector multiplications, reducing the complexity associated with Background SHAP computations. It also merges path nodes with identical features to minimize memory usage and improve cache efficiency. The algorithm is implemented in a fully vectorized, non-recursive manner, enhancing performance on standard computational environments.
Results
WOODELF-HD successfully computes Background SHAP values for decision trees with depths up to 21, where previous methods fail due to excessive memory usage. It achieves speed improvements of 33× for depth 12 and 162× for depth 15 compared to existing methods, demonstrating its effectiveness in handling high-depth decision trees.
Implications
The advancements presented in WOODELF-HD have significant implications for the interpretability of complex decision tree models, particularly in applications where deep trees are common. This method can enhance the understanding of model predictions in various domains, including finance, healthcare, and any field relying on predictive modeling.
RTMC: Step-Level Credit Assignment via Rollout Trees
Reinforcement Learning
Large Language Models
Optimization
- RTMC enables fine-grained credit assignment without a critic network.
- The state-action signature system compresses interaction histories for efficient state matching.
- Empirical results show a significant performance improvement over existing methods.
- The approach addresses the limitations of traditional critic-free methods in multi-step RL.
Read more
RTMC: Step-Level Credit Assignment via Rollout Trees
Summary
The paper introduces Rollout-Tree Monte Carlo (RTMC) advantage estimation, a novel approach to fine-grained credit assignment in multi-step agentic reinforcement learning (RL). Traditional methods, such as GRPO, assign uniform advantages across trajectories, which fails to differentiate between beneficial and harmful actions within a multi-turn episode. RTMC leverages the observation that group rollouts often traverse overlapping states, forming a tree structure that allows for the aggregation of return statistics to compute per-step Q-values and advantages without requiring a learned critic. The authors propose a state-action signature system to compress interaction histories into compact representations, facilitating effective state matching across rollouts. Empirical results demonstrate that RTMC significantly improves performance on SWE-bench Verified, achieving a 3.2 percentage point increase in pass@1 over GRPO, highlighting its effectiveness in addressing the credit assignment challenge in agentic RL.
Methodology
The authors propose RTMC, which aggregates return statistics from group rollouts over a tree structure to compute per-step Q-values and advantages. They introduce a state-action signature system to create compact representations of interaction histories, allowing for efficient state matching. A prior-based value smoothing mechanism is also implemented to ensure informative advantages throughout the tree.
Results
RTMC achieved a pass@1 score of 52.2% on SWE-bench Verified, representing a 3.2 percentage point improvement over the GRPO baseline. The results indicate additive gains at each stage of the empirical validation, showcasing the effectiveness of the proposed method.
Implications
The findings suggest that RTMC can enhance the training of agentic systems, particularly those utilizing large language models, by providing more accurate credit assignment. This could lead to improved performance in complex multi-turn tasks where traditional methods struggle.
EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
Interpretability
- EngageTriBoost (ETB) is an explainable ensemble ML framework for predicting user engagement in DMHI.
- ETB achieved up to 84% accuracy in predicting message posting, outperforming individual models.
- The study emphasizes interpretability and transparency in ML applications for mental health.
- SHAP was used to identify key behavioral and demographic factors associated with user engagement.
Read more
EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
Summary
This study addresses the rising mental health challenges among young adults by developing EngageTriBoost (ETB), an explainable ensemble machine learning framework aimed at predicting user engagement in digital mental health interventions (DMHI). The research focuses on the eBridge platform, which utilizes motivational interviewing-based online counseling. ETB was trained on data from 1,673 at-risk college students, incorporating 108 baseline features to predict engagement outcomes such as initial logins and message postings. The framework combines XGBoost, LightGBM, and CatBoost as base learners with logistic regression as a meta-learner, emphasizing interpretability over raw predictive performance. ETB achieved an accuracy of up to 84% in predicting message posting, demonstrating improved recall and calibration compared to individual models. The study also utilized Shapley Additive Explanations (SHAP) to analyze behavioral and demographic factors influencing engagement, revealing associations with chronic pain, stigma, and alcohol use. The findings highlight the potential of explainable ML in enhancing DMHI engagement and informing adaptive intervention strategies.
Methodology
The study employed a stacked ensemble approach using XGBoost, LightGBM, and CatBoost as base learners, with logistic regression as a meta-learner. The model was trained on data from 1,673 college students, utilizing 108 baseline features and engagement outcomes. Cross-validation was used for hyperparameter tuning, and SHAP was applied for interpretability.
Results
ETB achieved an accuracy of up to 84% in predicting user message posting, with improved recall and calibration compared to individual models. The framework demonstrated stable discrimination for message posting and conservative performance for predicting initial logins, highlighting the challenges in predicting user uptake.
Implications
The findings suggest that explainable ML can significantly enhance the design and effectiveness of digital mental health interventions by providing insights into user engagement patterns and informing adaptive strategies for improving adherence.
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Optimization
Large Language Models
Efficient ML
- MUON2 improves the spectral properties of the momentum matrix, enhancing convergence speed.
- The introduction of adaptive second-moment preconditioning leads to better optimization dynamics.
- MUON2 and its factorized variant, MUON2-F, consistently outperform previous optimizers with reduced computational costs.
- The method is validated through extensive experiments on large-scale models, showing significant efficiency gains.
Read more
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Summary
The paper introduces MUON2, an enhanced version of the MUON optimizer designed to improve the efficiency of large-scale foundation model pre-training. MUON has shown promise in optimizing neural network updates through iterative orthogonalization but suffers from high computational and communication costs due to multiple Newton–Schulz iterations required per optimization step. MUON2 addresses this limitation by incorporating Adam-style adaptive second-moment preconditioning before the orthogonalization process. This approach significantly improves the conditioning of the momentum matrix, leading to faster convergence and better optimization dynamics. The authors also present MUON2-F, a memory-efficient variant that retains most of the performance benefits of MUON2 while reducing memory overhead. Comprehensive experiments demonstrate that MUON2 and MUON2-F outperform the original MUON and other recent variants, achieving 40% fewer Newton–Schulz iterations across various pre-training tasks on models like GPT and LLaMA.
Methodology
The authors propose MUON2, which applies adaptive second-moment scaling to the momentum matrix prior to orthogonalization. This method enhances the spectral properties of the matrix, leading to improved convergence during Newton–Schulz iterations. Additionally, MUON2-F is introduced as a memory-efficient variant that utilizes a factorized second-moment preconditioner.
Results
Experimental results indicate that MUON2 and MUON2-F outperform the original MUON optimizer and other recent variants, achieving a 40% reduction in the number of Newton–Schulz iterations required for convergence across various pre-training tasks on models ranging from 60M to 1.3B parameters.
Implications
The advancements presented in MUON2 could significantly reduce the resource requirements for training large-scale foundation models, making it more feasible for researchers and practitioners to develop and deploy such models efficiently.
Robust Adversarial Policy Optimization Under Dynamics Uncertainty
Reinforcement Learning
Robotics
Optimization
- Introduces Robust Adversarial Policy Optimization (RAPO) to address dynamics uncertainty in RL.
- Combines trajectory-level robustness through AdvNet with model-level robustness via Boltzmann reweighting.
- Demonstrates improved resilience to uncertainty and generalization to out-of-distribution dynamics.
- Maintains dual tractability while enhancing performance on in-distribution tasks.
Read more
Robust Adversarial Policy Optimization Under Dynamics Uncertainty
Summary
This paper addresses the challenges of reinforcement learning (RL) policies that fail when faced with dynamics differing from those encountered during training. Existing methods, such as domain randomization and adversarial RL, do not fully mitigate the issues arising from dynamics uncertainty. The authors propose a novel framework called Robust Adversarial Policy Optimization (RAPO), which leverages a dual formulation to expose the trade-off between robustness and performance. RAPO incorporates two main components: an adversarial network (AdvNet) that approximates a temperature parameter for trajectory-level robustness, and Boltzmann reweighting over dynamics ensembles for model-level robustness. This dual approach allows for efficient and stable worst-case rollouts while ensuring that the policy is sensitive to adverse dynamics. The framework is designed to improve resilience to uncertainty and generalization to out-of-distribution dynamics without sacrificing performance on in-distribution tasks. The results demonstrate that RAPO outperforms existing robust RL baselines, showcasing its effectiveness in enhancing policy robustness in safety-critical applications.
Methodology
The authors utilize a dual formulation of robust Markov Decision Processes (RMDPs) to manage dynamics uncertainty. They implement an adversarial network (AdvNet) to predict trajectory-level dual parameters, steering rollouts towards adverse outcomes. Additionally, they apply Boltzmann reweighting over an ensemble of dynamics models to ensure structured coverage of adverse scenarios. This two-layer structure allows for independent yet complementary enhancements to robustness at both the trajectory and model levels.
Results
RAPO significantly outperforms existing robust RL baselines, demonstrating enhanced robustness to dynamics uncertainty and improved generalization capabilities. The framework effectively balances the trade-off between robustness and performance, yielding policies that maintain high performance in familiar environments while being resilient to unforeseen dynamics.
Implications
The findings suggest that RAPO can be effectively applied in safety-critical domains such as robotics, autonomous driving, and finance, where reliable performance under uncertain conditions is paramount. The framework's ability to generalize to out-of-distribution dynamics could lead to more robust and adaptable RL systems in real-world applications.
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Large Language Models
Reinforcement Learning
NLP
- Skill-SD introduces dynamic, trajectory-derived natural language skills as a teacher signal for self-distillation.
- An importance-weighted reverse-KL loss is developed to correct gradient biases during training.
- Dynamic synchronization of the teacher model with the student is crucial for maintaining training stability.
- Skill-SD outperforms traditional RL methods, showing significant improvements in task performance.
Read more
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Summary
The paper introduces Skill-SD, a novel framework designed to enhance the training of large language model (LLM) agents in multi-turn interactive tasks by addressing the limitations of reinforcement learning (RL) in terms of sample efficiency and training stability. Traditional RL methods struggle with sparse rewards and long task horizons, leading to inefficient learning. The authors propose using on-policy self-distillation (OPSD) to provide dense supervision from a privileged teacher model. However, fixed privileged information can restrict exploration and lead to training collapse when combined with RL. Skill-SD overcomes these challenges by transforming the agent's own trajectories into dynamic, training-only supervision in the form of natural language skills that encapsulate successful behaviors and workflows. This allows the student model to learn from diverse strategies rather than a single fixed solution. The framework employs an importance-weighted reverse-KL loss to ensure stable training and dynamically synchronizes the teacher model with the student to maintain effective guidance throughout the training process. Experimental results demonstrate that Skill-SD significantly outperforms standard RL baselines, achieving notable improvements in accuracy on agentic benchmarks.
Methodology
Skill-SD utilizes the agent's own trajectory history to create dynamic teacher signals in the form of natural language skills. It employs an importance-weighted reverse-KL loss to address distribution mismatches between the teacher and student models. The teacher is periodically synchronized with the student to ensure that the guidance remains relevant as the student improves.
Results
Skill-SD achieved 64.9% accuracy on the AppWorld benchmark and 62.5% on Sokoban, outperforming the vanilla Group Relative Policy Optimization (GRPO) by +14.0% and +10.9%, respectively, and surpassing vanilla on-policy distillation (OPD) by significant margins.
Implications
The proposed Skill-SD framework has the potential to improve the efficiency and effectiveness of training LLM agents in various interactive tasks, enabling them to learn from diverse strategies and enhancing their performance in real-world applications.
Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Time Series
- Introduces a harmonized survival benchmark for dropout risk modeling in Learning Analytics.
- Demonstrates that temporal and behavioral signals are more predictive of dropout risk than static demographic factors.
- Highlights the need for calibration and interpretability in predictive models beyond mere discrimination.
- Finds that Random Survival Forest and Poisson Piecewise-Exponential are top performers in their respective model arms.
Read more
Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Summary
This paper addresses the challenge of student dropout in Learning Analytics by proposing a survival-oriented benchmark for modeling temporal dropout risk. The authors highlight that existing studies often evaluate predictive models under varying protocols, focusing on discrimination rather than temporal interpretability and calibration. To fill this gap, they utilize the Open University Learning Analytics Dataset (OULAD) to compare two harmonized approaches: a dynamic weekly model and a continuous-time model. The evaluation framework incorporates four analytical layers: predictive performance, ablation, explainability, and calibration. Results indicate that within the continuous-time arm, Random Survival Forest excels in discrimination, while Poisson Piecewise-Exponential leads in the dynamic arm. Notably, the dominant predictive signals identified were temporal and behavioral rather than demographic, suggesting that dropout risk is a dynamic process. The findings advocate for a harmonized benchmarking approach in Learning Analytics, emphasizing the importance of temporal engagement over static attributes in dropout prediction.
Methodology
The study employs a survival-oriented benchmark using the OULAD dataset, comparing two modeling approaches: a dynamic weekly representation and a continuous-time representation. The evaluation includes predictive performance metrics (Integrated Brier Score and C-index), ablation analysis to assess feature importance, explainability analysis to identify predictive drivers, and calibration checks to ensure numerical coherence between predicted risks and observed outcomes.
Results
The results reveal that Random Survival Forest achieves the best discrimination in the continuous-time arm, while Poisson Piecewise-Exponential performs best in the dynamic arm. The analysis shows that the primary predictive signals are temporal and behavioral, with calibration issues noted in some models, particularly XGBoost AFT, which exhibited systematic bias.
Implications
The findings suggest that Learning Analytics practices should prioritize dynamic engagement metrics over static demographic data for dropout prediction. This approach can enhance the effectiveness of interventions aimed at improving student retention. Additionally, the study underscores the necessity for comprehensive evaluation protocols that include calibration and interpretability to inform educational support strategies.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
Large Language Models
NLP
Theory
- Escalation behavior in LLMs is critical for effective automation and varies significantly across models.
- Models exhibit miscalibration in their self-assessment of accuracy, affecting their decision-making.
- Interventions such as supervised fine-tuning on chain-of-thought targets can improve escalation decisions.
- The study highlights the need for careful characterization of model-specific escalation behavior prior to deployment.
Read more
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
Summary
This paper investigates the decision-making process of large language models (LLMs) regarding when to act on their predictions and when to escalate decisions to human operators. The authors model this as a decision under uncertainty, where an LLM predicts outcomes, estimates its accuracy, and weighs the costs of acting versus escalating. The study evaluates eight LLMs across five decision-making domains, revealing significant differences in their escalation behavior that are not predicted by model architecture or size. The authors find that LLMs often miscalibrate their self-assessments, leading to inconsistent escalation decisions. They propose interventions to improve this behavior, including cost framing and supervised fine-tuning on chain-of-thought targets, which yield robust escalation policies that generalize across various contexts. The findings underscore the importance of understanding and characterizing escalation behavior in LLMs before deployment, emphasizing the need for models to explicitly reason about uncertainty and decision costs.
Methodology
The authors evaluated eight LLMs from four model families across five decision-making tasks derived from human decision data. They analyzed the models' escalation behavior by assessing their self-assessment accuracy and decision thresholds. Various interventions were tested to improve decision-making, including cost ratio variations and supervised fine-tuning.
Results
The study found that LLMs are often miscalibrated in their self-assessments, leading to inconsistent escalation behavior. The thresholds for escalation varied significantly across models and were not linked to model architecture or size. Interventions, particularly supervised fine-tuning on chain-of-thought targets, resulted in improved and more consistent escalation decisions across different datasets and scenarios.
Implications
The findings suggest that understanding and characterizing escalation behavior is essential for deploying LLMs in automation tasks. Improved decision-making processes can enhance the reliability of automated systems, reducing the risk of errors and optimizing human workload management.
On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Multimodal
- First application of functional maps to multimodal neural representation alignment.
- Evidence that independently pretrained vision and language encoders develop similar spectral complexity.
- Identification of the spectral complexity–orientation gap in cross-modal representations.
- Introduction of three new diagnostics for evaluating cross-modal representation compatibility.
Read more
On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Summary
This paper investigates cross-modal alignment between independently pretrained vision and language encoders using the functional map framework from computational geometry. The study reveals that while the functional map framework underperforms compared to traditional methods like Procrustes alignment, it uncovers a significant structural property of multimodal representations. The authors find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar, indicating comparable intrinsic complexity. However, the functional map exhibits low diagonal dominance and high orthogonality error, suggesting that the eigenvector bases are unaligned. This phenomenon is termed the spectral complexity–orientation gap, highlighting that models may capture similar structures but organize them differently. The paper introduces three diagnostic metrics for assessing cross-modal representation compatibility: diagonal dominance, orthogonality deviation, and Laplacian commutativity error. The findings emphasize the need for further exploration of spectral alignment methods and their limitations in practical applications.
Methodology
The study employs a functional map framework to analyze the correspondence between vision and language encoders. It involves encoding samples from the Flickr30k dataset using DINOv2 and MiniLM, constructing k-nearest-neighbor graphs, computing normalized graph Laplacians, and extracting spectral bases. The functional map is derived by solving a regularized least-squares problem that penalizes Laplacian commutativity violations, and the results are compared against traditional alignment methods such as Procrustes and CCA.
Results
The functional map framework achieved a 2.2% i2t Recall@1 at 100 anchors, significantly lower than Procrustes (12.1%) and relative representations (13.4%). The analysis revealed a normalized spectral distance of 0.043 between the Laplacian eigenvalue spectra of the two encoders, indicating similar intrinsic complexity. However, the functional map exhibited near-zero diagonal dominance and a high orthogonality error of 70.15, indicating misalignment in the eigenvector bases.
Implications
The findings suggest that while independently pretrained models may capture similar structures, their organizational differences pose challenges for cross-modal alignment. The introduced diagnostics could guide future research in improving multimodal representation compatibility and alignment methods, potentially impacting applications in multimedia retrieval and cross-modal learning.
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Multimodal
- NeuroFlow is the first unified model for visual encoding and decoding from neural activity.
- It incorporates NeuroVAE for structured latent space modeling and XFM for consistent flow learning.
- The framework achieves superior performance and parameter efficiency compared to isolated methods.
- NeuroFlow captures consistent neural activation patterns, enhancing understanding of visual perception.
Read more
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Summary
NeuroFlow introduces a novel framework that unifies visual encoding and decoding from neural activity into a single flow model, addressing the inefficiencies of treating these tasks separately. Traditional approaches often require distinct models and training procedures for encoding (predicting brain activity from stimuli) and decoding (reproducing stimuli from brain activity), which can overlook the inherent consistency between these processes. NeuroFlow consists of two main components: NeuroVAE, a variational backbone that models neural variability and establishes a semantically structured latent space for bidirectional modeling, and Cross-modal Flow Matching (XFM), which learns a consistent flow between visual and neural latent distributions. This framework reformulates encoding and decoding as a reversible, time-dependent process within a shared latent space, ensuring strong semantic coherence. Empirical results demonstrate that NeuroFlow outperforms traditional methods in both encoding and decoding tasks while achieving higher computational efficiency. The model captures consistent activation patterns related to neural variability, marking a significant advancement towards unified visual encoding and decoding, with implications for future brain-computer interfaces.
Methodology
NeuroFlow employs a two-component architecture: NeuroVAE for modeling neural variability and creating a semantically organized latent space, and Cross-modal Flow Matching (XFM) to establish a reversible flow between visual and neural distributions. This allows for joint optimization of encoding and decoding processes within a shared latent space.
Results
NeuroFlow demonstrated superior performance in visual encoding and decoding tasks, achieving comparable or better results than existing methods while using only 25% of the parameters of the leading decoding model, MindEye2. The model also exhibited strong encoding-decoding consistency and captured coherent activation patterns in neural data.
Implications
NeuroFlow's unified approach provides a deeper understanding of the neural mechanisms underlying visual perception and has potential applications in developing advanced brain-computer interfaces that can facilitate bidirectional communication between visual stimuli and neural responses.
Structural Gating and Effect-aligned Lag-resolved Temporal Causal Discovery Framework with Application to Heat-Pollution Extremes
Time Series
Graph Learning
Interpretability
- Introduction of SGED-TCD framework for temporal causal discovery.
- Application to heatwave and air pollution extremes in China reveals significant causal relationships.
- Framework improves interpretability and robustness of causal graphs.
- Demonstrates distinct regional and seasonal heterogeneity in causal pathways.
Read more
Structural Gating and Effect-aligned Lag-resolved Temporal Causal Discovery Framework with Application to Heat-Pollution Extremes
Summary
This paper introduces the Structural Gating and Effect-aligned Discovery for Temporal Causal Discovery (SGED-TCD), a novel framework designed for lag-resolved causal discovery in complex multivariate time series. The framework integrates explicit structural gating, stability-oriented learning, perturbation-effect alignment, and unified graph extraction to enhance the interpretability, robustness, and functional consistency of causal graphs. The effectiveness of SGED-TCD is demonstrated through its application to the analysis of heatwave and air pollution extremes in eastern and northern China, utilizing large-scale climate indices and regional variables. The results reveal distinct regional and seasonal patterns in the causal relationships, indicating that warm-season extremes in Eastern China are primarily influenced by low-latitude oceanic variability, while cold-season extremes in Northern China are more affected by high-latitude circulation patterns. SGED-TCD successfully reconstructs weighted causal networks that highlight dominant lags and causal importance, showcasing its ability to recover interpretable and hierarchical causal pathways in complex climate systems. The framework is not limited to this specific application and offers a general approach for temporal causal discovery across various domains.
Methodology
SGED-TCD employs a backbone-decoupled architecture that includes a variable-level temporal encoder, a conditional structural gating tensor, a lag-aware target-node aggregator, and a prediction head. It explicitly parameterizes candidate causal links and maintains structural consistency under perturbations, distinguishing meaningful causal relations from spurious ones.
Results
The application of SGED-TCD resulted in the reconstruction of causal networks that revealed clear regional and seasonal differences in the drivers of heatwave and air pollution extremes. Specifically, it identified that warm-season extremes in Eastern China are linked to low-latitude oceanic variability, while cold-season extremes in Northern China are influenced by high-latitude circulation variability.
Implications
The findings underscore the importance of understanding the complex interactions between climate variables and their impacts on public health and environmental conditions. SGED-TCD provides a robust framework for future research in climate risk assessment and adaptation planning, as well as potential applications in other complex domains requiring temporal causal analysis.
CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation
Large Language Models
Interpretability
Time Series
- CARE-ECG integrates causal reasoning into ECG interpretation, improving explainability and counterfactual analysis.
- The framework encodes ECG signals into structured latent biomarkers, enhancing the interpretability of physiological factors.
- CARE-ECG demonstrates significant improvements in diagnostic accuracy and reduces hallucinations in outputs.
- The system supports rigorous evaluation through causal graph inference and counterfactual assessments.
Read more
CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation
Summary
The paper introduces CARE-ECG, a novel framework for ECG interpretation that leverages causal reasoning to enhance explainability and counterfactual analysis in clinical decision-making. Traditional ECG-LLM systems often struggle with weak signal-text alignment and lack robust grounding in physiological structures, leading to unreliable outputs. CARE-ECG addresses these issues by encoding multi-lead ECGs into temporally organized latent biomarkers and employing causal graph inference for probabilistic diagnosis. This framework supports counterfactual assessments through structural causal models, allowing for a more nuanced understanding of how different physiological states can affect diagnostic outcomes. The system integrates a modular agentic pipeline that combines historical data, diagnosis, and responses, ensuring that language outputs are grounded in causal evidence. The authors demonstrate that CARE-ECG significantly improves diagnostic accuracy and explanation faithfulness while reducing hallucinations in outputs, achieving an accuracy of 0.84 on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL using GPT-4. Overall, CARE-ECG provides a traceable reasoning structure that enhances the reliability of ECG interpretations in clinical settings.
Methodology
CARE-ECG employs a modular pipeline that encodes multi-lead ECGs into latent biomarkers, utilizes causal graph inference for diagnosis, and incorporates counterfactual reasoning through structural causal models. The system grounds language outputs in causal evidence and integrates historical data for improved decision-making.
Results
CARE-ECG achieved diagnostic accuracy of 0.84 on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL, demonstrating improved explanation faithfulness and reduced hallucinations compared to traditional ECG-LLM systems.
Implications
The advancements presented by CARE-ECG could lead to more reliable ECG interpretation tools in clinical settings, enhancing decision support for clinicians and potentially improving patient outcomes through better diagnostic accuracy and explainability.
Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
NLP
Efficient ML
Theory
- Introduces EXPONA, a novel framework for automated data annotation using label functions.
- Balances diversity and reliability in LF generation through a two-phase exploration and exploitation process.
- Achieves near-complete label coverage and significantly improves weak label quality compared to existing methods.
- Demonstrates substantial downstream performance gains across diverse classification tasks.
Read more
Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Summary
The paper presents EXPONA, an automated framework designed for programmatic labeling that addresses the challenges of generating high-quality labeled data for machine learning. Traditional methods for generating label functions (LFs) often rely on large language models or model-based synthesis, which can lead to limited coverage and unreliable label quality. EXPONA innovatively formulates LF generation as a process that balances diversity and reliability by exploring multi-level LFs—surface, structural, and semantic perspectives. The framework employs a two-phase approach: LF exploration, which generates diverse candidate LFs based on task descriptions and data characteristics, and LF exploitation, which evaluates and filters these candidates to retain only the most reliable heuristics. Extensive experiments on eleven classification datasets demonstrate that EXPONA significantly outperforms existing automated LF generation methods, achieving up to 98.9% label coverage and improving weak label quality by up to 87%. The results indicate that EXPONA's approach leads to substantial downstream performance gains, showcasing its effectiveness in producing high-quality pseudo-labels for model training.
Methodology
EXPONA employs a two-phase process: LF exploration generates diverse candidate label functions from multiple perspectives, while LF exploitation evaluates and filters these candidates based on performance indicators to ensure reliability and quality.
Results
EXPONA achieved up to 98.9% label coverage and improved weak label quality by up to 87%, leading to downstream performance gains of up to 46% in weighted F1 scores across eleven classification datasets.
Implications
The framework can significantly reduce the manual effort required for data annotation in machine learning, enabling the creation of high-quality labeled datasets at scale. This has potential applications in various domains where labeled data is critical for model training.
Exploring the impact of fairness-aware criteria in AutoML
Optimization
- Integrating fairness metrics into AutoML can significantly improve fairness outcomes.
- A trade-off exists between predictive performance and fairness, with a noted decrease in predictive power when fairness is prioritized.
- The study employs a multi-criteria optimization approach to balance fairness and performance metrics.
- Fairness-aware AutoML can lead to simpler and more efficient ML solutions.
Read more
Exploring the impact of fairness-aware criteria in AutoML
Summary
This paper investigates the integration of fairness-aware criteria into Automated Machine Learning (AutoML) frameworks, which traditionally prioritize predictive performance. The authors highlight the risks of biased data leading to unfair outcomes in ML systems, particularly as AutoML becomes more prevalent. Previous approaches have primarily focused on model selection and hyperparameter tuning, often neglecting other crucial stages of the ML pipeline. This study proposes a novel method that incorporates fairness metrics directly into the optimization component of AutoML, allowing for a more comprehensive approach to fairness across the entire ML pipeline. The authors utilize a multi-criteria single-objective genetic algorithm to balance predictive performance and fairness metrics, specifically integrating three fairness metrics (Demographic Parity, Equalised Odds, and Absolute Between-ROC Area) alongside predictive performance metrics (Matthews Correlation Coefficient and True Positive Rate). The results demonstrate a trade-off, with a 9.4% decrease in predictive power but a 14.5% improvement in fairness and a 35.7% reduction in data usage. The findings suggest that fairness integration can lead to simpler, yet effective ML solutions, emphasizing the importance of optimizing the entire ML workflow for more balanced outcomes.
Methodology
The authors developed a fairness-aware optimization framework within an AutoML system, utilizing a multi-criteria single-objective genetic algorithm. This approach integrates fairness metrics (Demographic Parity, Equalised Odds, Absolute Between-ROC Area) and predictive performance metrics (Matthews Correlation Coefficient, True Positive Rate) to guide the optimization of the entire ML pipeline, from data selection to model tuning.
Results
The integration of fairness metrics resulted in a 14.5% improvement in fairness metrics, a 9.4% decrease in predictive power, and a 35.7% reduction in data usage compared to a baseline focused solely on predictive performance. The final ML solutions were also simpler, indicating that complexity is not always necessary for achieving fairness.
Implications
This research suggests that fairness-aware AutoML frameworks can lead to more equitable decision-making processes in ML applications, particularly in sensitive areas like healthcare and finance. It emphasizes the need for a holistic approach to fairness in ML systems, potentially influencing future AutoML designs and practices.
Integrated electro-optic attention nonlinearities for transformers
Computer Vision
NLP
Efficient ML
- Proposes analog nonlinearities using TFLN MZMs to replace traditional Softmax in Transformers.
- Demonstrates competitive accuracy in Vision Transformers and Large Language Models with 4-bit quantization.
- Characterizes system performance under high encoding speeds and various noise conditions.
- Addresses the computational bottleneck of Softmax operations in neural networks.
Read more
Integrated electro-optic attention nonlinearities for transformers
Summary
This paper addresses the bottleneck in inference latency caused by the Softmax operation in Transformer architectures, which, despite accounting for less than 1% of total operations, significantly impacts performance due to its nonlinear nature. The authors propose the use of thin-film lithium niobate (TFLN) Mach-Zehnder modulators (MZMs) as analog nonlinear computational elements to replace digital Softmax and Sigmoid functions. By implementing these electro-optic alternatives in Vision Transformers and Large Language Models, the authors demonstrate that their system can maintain competitive accuracy even with aggressive 4-bit quantization of the analog units. The study further explores the performance of these systems under various noise conditions and encoding speeds up to 10 GBaud, suggesting that TFLN modulators can effectively serve as nonlinear function units in hybrid co-packaged hardware, enhancing the speed and energy efficiency of nonlinear computations in Transformers.
Methodology
The authors implemented analog nonlinearities using TFLN Mach-Zehnder modulators to replace the Softmax and Sigmoid functions in Transformer architectures. They evaluated the performance of these implementations in Vision Transformers and Large Language Models, focusing on accuracy, latency, and robustness under different noise conditions and quantization levels.
Results
The results indicate that the proposed electro-optic nonlinearities maintain high accuracy comparable to traditional methods, even with aggressive quantization. The system demonstrated effective performance at encoding speeds of up to 10 GBaud, with a significant reduction in inference latency compared to conventional digital implementations.
Implications
The findings suggest that integrating electro-optic components into Transformer architectures can lead to significant improvements in processing speed and energy efficiency, potentially enabling more scalable and efficient AI systems in natural language processing and computer vision tasks.
A Hybrid Intelligent Framework for Uncertainty-Aware Condition Monitoring of Industrial Systems
Time Series
- Development of a hybrid condition monitoring framework integrating data-driven and physics-based approaches.
- Exploration of two hybrid integration strategies: feature-level fusion and model-level ensemble.
- Demonstrated improvements in diagnostic accuracy and decision reliability through hybridization.
- Application of conformal prediction for effective uncertainty quantification.
Read more
A Hybrid Intelligent Framework for Uncertainty-Aware Condition Monitoring of Industrial Systems
Summary
This paper presents a hybrid intelligent framework aimed at enhancing condition monitoring in industrial systems by integrating data-driven learning with physics-based insights. The proposed framework utilizes primary sensor measurements, lagged temporal features, and physics-informed residuals derived from nominal surrogate models. Two integration strategies are explored: a feature-level fusion approach that enriches the input space with residual and temporal data, and a model-level ensemble approach that combines classifiers trained on different feature types at the decision level. The framework is evaluated using a continuous stirred-tank reactor (CSTR) benchmark, employing various machine learning models and ensemble configurations. The results indicate that both hybrid approaches significantly improve diagnostic accuracy compared to single-source baselines, with the model-level ensemble achieving a 2.9% enhancement over the best baseline. Additionally, the use of conformal prediction for uncertainty quantification reveals that hybrid integration not only boosts accuracy but also enhances decision reliability by producing smaller, well-calibrated prediction sets. This work underscores the effectiveness of combining physics-informed residuals, temporal features, and ensemble learning to address the challenges of fault detection in nonlinear industrial systems.
Methodology
The methodology involves a hybrid condition monitoring framework that combines data-driven fault classification, physics-informed residual feature extraction, hybrid and ensemble fusion of information, and uncertainty quantification using conformal prediction. The framework is designed to address fault detection in nonlinear, closed-loop industrial processes.
Results
The evaluation of the hybrid framework on the CSTR benchmark showed that both feature-level and model-level hybridization improved diagnostic accuracy, with the best model-level ensemble achieving a 2.9% improvement over the best baseline. The conformal prediction analysis indicated enhanced uncertainty management, resulting in smaller and well-calibrated prediction sets.
Implications
The findings suggest that integrating physics-informed insights with machine learning can significantly enhance the reliability and accuracy of condition monitoring systems in industrial applications. This approach may lead to improved safety and efficiency in industrial operations by enabling better fault detection and management.
MoEITS: A Green AI approach for simplifying MoE-LLMs
Large Language Models
Efficient ML
Theory
- Introduction of MoEITS, a simplification algorithm for MoE-LLMs based on information theory.
- Utilization of normalized mutual information to detect redundancy among experts.
- Extensive empirical evaluation shows MoEITS outperforms existing pruning methods.
- The method contributes to reducing computational burden and energy consumption in AI systems.
Read more
MoEITS: A Green AI approach for simplifying MoE-LLMs
Summary
The paper presents MoEITS, a novel algorithm designed to simplify Mixture-of-Experts (MoE) large language models (LLMs) while addressing the computational and environmental challenges associated with their deployment. MoE-LLMs, which enhance model capabilities through ensemble methods, often require substantial hardware resources, leading to increased energy consumption. MoEITS leverages normalized mutual information (NMI) to assess redundancy among experts in MoE blocks, allowing for effective simplification without significantly compromising model accuracy. The authors conduct a comprehensive theoretical and empirical analysis of MoEITS, including its computational complexity and performance evaluation against state-of-the-art pruning methods on models such as Mixtral 8 × 7B, Qwen1.5-2.7B, and DeepSeek-V2-Lite. The results indicate that MoEITS not only simplifies these models but also maintains or enhances their effectiveness across various benchmarks, demonstrating its potential as a sustainable approach in the Green AI paradigm.
Methodology
MoEITS employs a simplification process that analyzes expert redundancy using normalized mutual information. The algorithm simplifies MoE blocks by eliminating less relevant experts while preserving the knowledge of the original model through adjusted weight matrices. The effectiveness of MoEITS is validated through a series of experiments on various MoE-LLMs, comparing performance metrics against established benchmarks.
Results
The empirical evaluation demonstrates that MoEITS achieves significant reductions in model complexity while maintaining high accuracy across multiple benchmarks. It outperforms state-of-the-art pruning techniques, indicating its effectiveness in generating computationally efficient models without sacrificing performance.
Implications
The findings suggest that MoEITS can be a valuable tool for researchers and practitioners seeking to develop more sustainable AI systems. By reducing the computational and environmental impact of large language models, MoEITS aligns with the growing emphasis on Green AI, making it applicable in various domains where resource efficiency is critical.
Virtual Smart Metering in District Heating Networks via Heterogeneous Spatial-Temporal Graph Neural Networks
Graph Learning
Time Series
Optimization
- Introduction of HSTGNN for virtual smart metering in district heating networks.
- Development of a controlled laboratory dataset for benchmarking virtual sensing methods.
- Significant performance improvement over existing data-driven methods.
- Joint modeling of cross-variable and spatial correlations in thermal and hydraulic states.
Read more
Virtual Smart Metering in District Heating Networks via Heterogeneous Spatial-Temporal Graph Neural Networks
Summary
This paper addresses the challenges of monitoring district heating networks, which are often sparsely instrumented and affected by sensor faults. The authors propose a novel approach using a Heterogeneous Spatial-Temporal Graph Neural Network (HSTGNN) to create virtual smart heat meters. This model captures the complex relationships between pressure, flow, and temperature in these networks, allowing for improved observability and operational efficiency. The study also introduces a new controlled laboratory dataset that provides high-resolution, synchronized measurements, facilitating further research in this area. The experiments conducted demonstrate that the proposed HSTGNN significantly outperforms existing baseline methods, highlighting the effectiveness of graph-based learning in enhancing the operation of district heating systems.
Methodology
The authors developed a Heterogeneous Spatial-Temporal Graph Neural Network (HSTGNN) that incorporates the functional relationships in district heating networks. The model uses dedicated branches to learn both graph structures and temporal dynamics for flow, temperature, and pressure measurements, enabling the joint modeling of their interdependencies.
Results
The proposed HSTGNN model demonstrated a significant performance improvement over existing baseline methods in accurately estimating thermal and hydraulic states in district heating networks. The introduction of a new dataset allowed for systematic evaluation and validation of the model's effectiveness.
Implications
The findings suggest that HSTGNN can enhance the observability and operational efficiency of district heating systems, supporting the transition towards intelligent energy networks. This approach could facilitate better predictive control, optimization, and fault detection in thermal energy networks.
Consensus-based Recursive Multi-Output Gaussian Process
Robotics
Efficient ML
Theory
- CRMGP combines recursive updates with distributed information fusion to enhance scalability.
- The framework supports bounded per-step computation, making it suitable for real-time applications.
- It preserves cross-output correlations, allowing for improved performance in multi-output tasks.
- Experiments show that CRMGP outperforms traditional centralized Gaussian process models in predictive accuracy and uncertainty calibration.
Read more
Consensus-based Recursive Multi-Output Gaussian Process
Summary
The paper introduces a novel framework called Consensus-based Recursive Multi-Output Gaussian Process (CRMGP) that addresses the challenges of deploying multi-output Gaussian processes (MOGPs) in large-scale, distributed, and streaming environments. Traditional MOGPs, while effective in modeling vector-valued fields with uncertainty, suffer from high computational costs and centralized processing, making them impractical for real-time applications in multi-agent systems. The proposed CRMGP framework integrates recursive inference on shared basis vectors with neighbor-to-neighbor consensus updates, enabling fully distributed learning with bounded computational costs per step. This method preserves inter-output correlations and maintains calibrated uncertainty, which is crucial for applications in robotic systems and environmental monitoring. The authors validate the effectiveness of CRMGP through experiments on synthetic wind fields and real LiDAR data, demonstrating that it achieves competitive predictive performance and reliable uncertainty calibration compared to centralized models.
Methodology
The authors propose a framework that utilizes recursive inference techniques alongside consensus algorithms for distributed learning. Each agent maintains a local model and exchanges compact summaries of information with neighboring agents, allowing for efficient updates and preserving the relationships between multiple outputs. The framework is designed to handle streaming data and operates with bounded computational costs.
Results
The experiments conducted on synthetic wind fields and real LiDAR data indicate that CRMGP achieves competitive predictive performance compared to both centralized single-output and multi-output Gaussian Process models. Additionally, it demonstrates reliable uncertainty calibration, making it a robust alternative for multi-agent sensing applications.
Implications
The CRMGP framework has significant implications for real-time decision-making and control in multi-robot systems, environmental monitoring, and other applications where distributed data collection and processing are essential. Its ability to maintain uncertainty calibration while scaling to large networks makes it a valuable tool for future research and practical implementations in distributed learning environments.
Layerwise Dynamics for In-Context Classification in Transformers
Theory
Interpretability
Large Language Models
- Enforcing symmetry in transformers enhances interpretability and reveals the underlying algorithmic structure.
- A closed-form layerwise recursion is derived, demonstrating coupled dynamics between feature and label geometries.
- The symmetry-enforced approach predicts behavior across various classification tasks, including semi-supervised learning.
- The study provides a framework for understanding transformer dynamics beyond traditional optimization abstractions.
Read more
Layerwise Dynamics for In-Context Classification in Transformers
Summary
This paper investigates the inference-time dynamics of transformers in the context of in-context classification (ICL), revealing that the underlying algorithm is not merely a generic optimizer but a structured, geometry-driven process. By enforcing feature- and label-permutation symmetry at each layer, the authors derive a closed-form layerwise recursion that describes how feature and label geometries co-evolve during classification. This approach not only enhances interpretability by yielding structured weights but also allows for the extraction of a shared algorithmic motif that can be applied across various tasks. The study demonstrates that enforcing symmetry leads to a coupled mean-shift dynamic, which improves class separation and robustness in predictions. The findings suggest that the dynamics of transformers can be understood and predicted through this symmetry-enforced framework, extending its applicability to semi-supervised learning and other classification settings.
Methodology
The authors enforce feature- and label-permutation symmetry layer by layer within the transformer architecture. This is achieved by conjugating the attention block with random permutations over feature and label coordinates, ensuring that each layer implements the same computation across symmetric coordinate systems. The resulting dynamics are analyzed to derive a closed-form layerwise recursion for in-context classification.
Results
The study finds that enforcing symmetry leads to a clear, interpretable layerwise recursion that describes the dynamics of the transformer during inference. This recursion supports the idea that the model's predictions are driven by a coupled mean-shift dynamic, which enhances class separation and robustness. Additionally, retraining transformers on various tasks confirms that the same symmetry-aligned weight structure and dynamics emerge consistently.
Implications
The findings suggest that transformers can be better understood and utilized in various classification tasks by enforcing symmetry, leading to improved interpretability and performance. This approach may also inform future research on semi-supervised learning and other settings where class separation is critical.
Offline Local Search for Online Stochastic Bandits
Optimization
Theory
- Introduces a framework for converting offline local search algorithms into online stochastic bandit algorithms.
- Achieves O(log3 T) regret, improving upon existing frameworks that yield polynomial regret.
- Demonstrates flexibility by applying the framework to various combinatorial optimization problems.
- Establishes conditions for local search neighborhoods to ensure effective online performance.
Read more
Offline Local Search for Online Stochastic Bandits
Summary
This paper investigates the conversion of offline local search algorithms into online stochastic bandit algorithms, focusing on combinatorial multi-armed bandits. The authors propose a generic method that allows local search methods, which have been under-explored in the context of bandit algorithms, to be adapted for online decision-making scenarios. The key contribution is the establishment of a framework that guarantees O(log3 T) regret for online algorithms derived from offline local search algorithms that terminate in approximately optimal solutions. The paper demonstrates the applicability of this framework across three combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid, and uncertain clustering. The results indicate that local search algorithms can effectively minimize regret in online settings, providing a new avenue for leveraging offline algorithm design in online learning environments.
Methodology
The authors develop a generic local search algorithm that iteratively improves solutions based on a neighborhood structure and a cost function. They establish conditions for (β, γ)-improving moves, allowing the conversion of offline local search into online algorithms with bandit feedback. The methodology includes theoretical analysis to derive regret bounds based on the properties of the local search algorithm and its neighborhood.
Results
The main result shows that local search algorithms with (β, γ)-improving neighborhoods can achieve γ-regret with O(log3 T) dependence on the number of rounds T. This result is significant as it provides a more efficient regret bound compared to existing methods, which typically yield polynomial regret.
Implications
The findings suggest that local search methods can be effectively utilized in online learning scenarios, potentially leading to improved decision-making in various applications such as scheduling, resource allocation, and clustering. This work opens up new avenues for research in combining offline algorithm design with online learning frameworks.
Task2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings
Federated Learning
- Introduction of Task2Vec Readiness as a diagnostic tool for federated learning.
- Utilization of unsupervised metrics derived from Task2Vec embeddings to assess federation alignment.
- Demonstrated strong correlation between readiness indices and final FL performance across various datasets.
- Framework provides actionable guidance for client selection in heterogeneous federations.
Read more
Task2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings
Summary
This paper addresses the challenge of predicting the performance of federated learning (FL) systems, which are often affected by client heterogeneity. The authors propose a novel framework called Task2Vec Readiness, which utilizes Task2Vec embeddings to derive readiness indices that quantify the alignment of a federation prior to training. These indices are based on unsupervised metrics such as cohesion, dispersion, and density, calculated from client embeddings. The framework is evaluated across multiple datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) and varying client counts (10-20) under different levels of Dirichlet heterogeneity. The correlation analyses reveal significant relationships between the readiness indices and final model performance, often exceeding 0.9 in Pearson and Spearman coefficients. This validates the effectiveness of Task2Vec-based readiness as a pre-training diagnostic tool for FL, providing insights for client selection and enhancing the efficiency of federated training.
Methodology
The authors compute Task2Vec embeddings for each client's data distribution using Fisher Information, transforming them into a fixed-dimensional representation. They then derive readiness indices based on unsupervised metrics: cohesion (average cosine similarity), dispersion (average distance from centroid), and density (RBF-kernel similarity). These metrics are evaluated across multiple datasets and client configurations, with correlation analyses performed to assess their predictive validity.
Results
The study finds consistent and significant correlations between the Task2Vec-based readiness indices and the final performance of federated learning models, with Pearson and Spearman coefficients often exceeding 0.9. This indicates that the readiness indices can reliably predict FL outcomes across different datasets and levels of client heterogeneity.
Implications
The Task2Vec Readiness framework offers a practical tool for practitioners in federated learning, enabling them to anticipate the success of their federations before training begins. This can lead to more efficient client selection, reduced trial-and-error experimentation, and improved overall performance in federated training scenarios.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Computer Vision
NLP
Generative Models
- ECHO achieves efficient one-step report generation for chest X-rays, significantly reducing inference latency.
- The Direct Conditional Distillation (DCD) framework enables coherent outputs by addressing mean-field bias.
- Response-Asymmetric Diffusion (RAD) enhances training efficiency and model effectiveness.
- ECHO surpasses existing autoregressive methods in performance metrics while maintaining clinical accuracy.
Read more
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Summary
The paper presents ECHO, an innovative approach to chest X-ray report generation (CXR-RG) that addresses the high inference latency associated with traditional autoregressive vision-language models (VLMs). While diffusion-based models offer parallel generation capabilities, they typically require multiple denoising iterations, which can compromise output coherence. ECHO introduces a Direct Conditional Distillation (DCD) framework that allows for stable one-step-per-block inference by mitigating the mean-field bias inherent in token-factorized denoisers. Additionally, the Response-Asymmetric Diffusion (RAD) training strategy enhances training efficiency without sacrificing model effectiveness. Experimental results demonstrate that ECHO significantly outperforms state-of-the-art autoregressive methods, achieving a 64.33% improvement in RaTE and a 60.58% increase in SemScore, while also providing an 8× speedup in inference time without compromising clinical accuracy.
Methodology
ECHO employs a novel Direct Conditional Distillation (DCD) framework to construct unfactorized supervision from on-policy diffusion trajectories, allowing for stable one-step-per-block inference. The Response-Asymmetric Diffusion (RAD) adaptation is also introduced to optimize training efficiency. The model is trained on curated data to enhance its performance in generating comprehensive CXR reports.
Results
ECHO outperforms state-of-the-art autoregressive models, achieving a 64.33% improvement in RaTE and a 60.58% increase in SemScore. The model also provides an 8× speedup in inference time, demonstrating its efficiency without compromising clinical accuracy.
Implications
The advancements presented in ECHO could significantly alleviate the workload of radiologists by enabling rapid and accurate chest X-ray report generation, thus improving clinical workflows and patient care.
From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
Audio & Speech
Theory
Interpretability
- Introduction of the Spectral Sensitivity Theorem to explain hallucinations in ASR models.
- Identification of two regimes: Structural Disintegration in smaller models and Compression-Seeking Attractor in larger models.
- Validation of theoretical predictions through eigenspectral analysis of Whisper models under adversarial stress.
- Demonstration that standard performance metrics may not adequately predict hallucination onset.
Read more
From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
Summary
This paper addresses the critical issue of hallucinations in large Automatic Speech Recognition (ASR) models, particularly the Whisper models. The authors introduce the Spectral Sensitivity Theorem, which predicts a phase transition in deep networks from a dispersive regime (characterized by signal decay) to an attractor regime (rank-1 collapse). This transition is influenced by layer-wise gain and alignment. The authors validate their theory by analyzing the eigenspectra of activation graphs in Whisper models of varying sizes under adversarial conditions. The findings reveal that intermediate-sized models experience Structural Disintegration, with a 13.4% collapse in Cross-Attention rank, while larger models enter a Compression-Seeking Attractor state, where Self-Attention compresses rank by -2.34% and hardens the spectral slope, leading to a decoupling from acoustic evidence. The study emphasizes the importance of understanding internal representational dynamics to address hallucinations in ASR systems.
Methodology
The authors developed a theoretical framework based on the Spectral Sensitivity Theorem, modeling signal propagation in Transformer networks as a discrete dynamical system. They analyzed the spectral properties of the accumulated context-sensitivity network Jacobian to characterize how acoustic information is preserved or suppressed across different model sizes. The analysis involved eigenspectral examination of activation graphs in Whisper models under adversarial stress.
Results
The analysis confirmed the theoretical predictions: smaller Whisper models exhibited a 13.4% collapse in Cross-Attention rank, indicating Structural Disintegration, while larger models showed a -2.34% rank compression in Self-Attention, entering a Compression-Seeking Attractor state. This transition highlights a significant change in how these models process acoustic information as they scale.
Implications
The findings suggest that as ASR models scale, their internal dynamics change significantly, leading to potential safety risks due to hallucinations. Understanding these dynamics can inform the design of more robust ASR systems and improve the interpretability of model behavior under adversarial conditions.
Toward World Models for Epidemiology
Time Series
- Introduces a conceptual framework for epidemiological world models.
- Reframes epidemic decision-making to incorporate latent states and human behavior.
- Presents three case studies demonstrating the utility of world models in policy analysis.
- Highlights the limitations of traditional epidemiological models in dynamic environments.
Read more
Toward World Models for Epidemiology
Summary
This paper proposes a novel framework for integrating world models into computational epidemiology, emphasizing the need for a more dynamic understanding of epidemic processes. The authors argue that traditional epidemiological models often overlook the complexities of latent disease states, noisy observations, and the impact of human behavior on intervention outcomes. By framing epidemics as controlled, partially observed dynamical systems, the paper introduces a conceptual framework that allows for better decision-making under uncertainty. The authors present three case studies that highlight the necessity of world modeling in epidemiology: strategic misreporting in surveillance, delays in reporting signals, and counterfactual analysis of interventions. These case studies demonstrate how world models can enhance policy-relevant reasoning and improve the robustness of decisions in the face of incomplete data and adaptive human responses.
Methodology
The authors develop a formal conceptual framework that treats epidemics as controlled, partially observed dynamical systems. They utilize case studies to illustrate the application of this framework, focusing on strategic misreporting, time-lagged signals, and counterfactual intervention analysis.
Results
The case studies reveal that incorporating world models allows for more accurate predictions and better policy decisions by addressing the complexities of human behavior and the dynamic nature of epidemiological data. The findings suggest that traditional models may be insufficient for capturing the nuances of epidemic dynamics.
Implications
The proposed framework has significant implications for public health policy and decision-making, particularly in the context of pandemic response. By leveraging world models, policymakers can better understand the potential outcomes of various interventions and adapt strategies based on real-time data and human behavior.
Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning
Theory
Optimization
- Introduction of a novel membership matrix decomposition that resolves incompatibility between clustering and multi-label scenarios.
- Development of a three-stage weakly-supervised clustering framework that optimizes pseudo-labels and class prototypes.
- Implementation of an adaptive confidence mechanism that adjusts supervision strength based on prototype-distance relationships.
- Demonstration of superior performance over existing methods on multiple datasets.
Read more
Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning
Summary
This paper addresses the challenges posed by label noise in multi-label learning (MLL), particularly in the context of partial multi-label learning (PML), where candidate labels may include both relevant and irrelevant labels. The authors propose a novel weakly-supervised clustering approach, termed WSC-PML, which integrates clustering with multi-label learning through a unique membership matrix decomposition. This decomposition separates the clustering membership matrix into two components: one that maintains clustering constraints and another that preserves multi-label characteristics. The WSC-PML framework consists of three stages: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement. The proposed method effectively leverages data structure for noise identification and significantly improves label noise handling in PML scenarios. Experimental results on 24 datasets demonstrate that WSC-PML outperforms six state-of-the-art methods across all evaluation metrics, showcasing its effectiveness in real-world applications where label noise is prevalent.
Methodology
The methodology involves a three-stage process: (1) initial learning of prototypes from noisy candidate labels, (2) construction of adaptive weak supervision based on the distances to prototypes, and (3) joint optimization of pseudo-labels and class prototypes through iterative clustering refinement. The key innovation is the decomposition of the clustering membership matrix into two components to facilitate the integration of clustering with multi-label learning.
Results
The experimental evaluation on 24 datasets shows that WSC-PML consistently outperforms six state-of-the-art methods across all metrics, indicating its effectiveness in handling label noise and improving the accuracy of multi-label predictions.
Implications
The findings suggest that WSC-PML can be applied in various domains where multi-label learning is relevant, such as image annotation, text categorization, and medical diagnosis, particularly in scenarios where label noise is a significant concern. This approach opens new avenues for structure-aware multi-label learning, enhancing the robustness of models in real-world applications.
Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
Theory
Efficient ML
Optimization
- Introduction of Profiled Sparse Networks (PSN) for structured heterogeneous sparsity.
- Static connectivity structures do not significantly affect accuracy at matched parameter counts.
- Fan-in coefficient of variation (CV) predicts gradient concentration, indicating structural importance.
- Lognormal initialization based on equilibrium fan-in distribution outperforms standard methods.
Read more
Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
Summary
This paper introduces Profiled Sparse Networks (PSN), which replace uniform connectivity in neural networks with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions. This approach allows for the creation of neurons with both dense and sparse receptive fields. The authors benchmark PSN across four classification datasets (MNIST, Fashion-MNIST, EMNIST, and Forest Cover) with varying input dimensions and network depths. The results show that at 90% sparsity, static profiles, including a uniform random baseline, achieve accuracy within 0.2–0.6% of dense baselines across all datasets, indicating that arbitrary hub placement does not confer an accuracy advantage. The study also reveals that structured profiles lead to a 2–5× concentration of gradients at hub neurons compared to uniform distributions. Furthermore, initializing RigL dynamic sparse training with lognormal profiles matched to the equilibrium fan-in distribution consistently outperforms standard ERK initialization, particularly on more challenging tasks. The findings suggest that the placement of hub neurons is more critical than the degree of connectivity variance, emphasizing the importance of task-aligned hub placement over random configurations.
Methodology
The authors developed PSN by assigning per-neuron fan-in according to continuous nonlinear profile functions, allowing for controlled exploration of connectivity designs. They conducted experiments on four classification datasets, analyzing the impact of different sparsity levels and fan-in distributions on model performance and gradient behavior.
Results
The study found that PSN maintains competitive accuracy at high sparsity levels (80-99.9%) without significant performance loss compared to dense networks. The internal gradient analysis showed a strong correlation (r = 0.93) between fan-in CV and gradient concentration. Additionally, initializing RigL with lognormal profiles led to improved performance on harder tasks, with notable accuracy gains over standard initialization methods.
Implications
The findings suggest that designing neural network architectures with structured heterogeneous connectivity can enhance training efficiency and performance. This approach could inform future neural network designs and training methodologies, particularly in scenarios requiring high sparsity.
Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features
Multimodal
Time Series
Efficient ML
- Self-supervised learning (SSL) can effectively predict below-ground ectomycorrhizal fungal richness using satellite imagery.
- The proposed method achieves a 10,000-fold increase in spatial resolution compared to traditional biodiversity monitoring techniques.
- SSL-derived features are more informative than conventional climate, soil, and land cover datasets for predicting fungal diversity.
- The study enables temporal monitoring of fungal biodiversity, revealing trends in diversity loss in ancient forests.
Read more
Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features
Summary
This paper addresses the challenge of monitoring below-ground fungal biodiversity, specifically ectomycorrhizal fungi, which are crucial for ecosystem functioning. Traditional methods of assessing fungal diversity are often limited by high costs and logistical challenges, leading to significant gaps in biodiversity protection. The authors propose a novel approach using self-supervised learning (SSL) applied to satellite imagery to predict fungal richness across diverse environments. Their models explain over half the variance in species richness based on approximately 12,000 field samples from Europe and Asia. The study demonstrates that SSL-derived features are the most informative predictors of fungal diversity, outperforming traditional climate, soil, and land cover datasets. The approach achieves a remarkable increase in spatial resolution from 1 km to 10 m, allowing for detailed habitat-scale observations. Additionally, the dynamic nature of satellite observations enables temporal monitoring of biodiversity for the first time. The authors analyze trends in predicted fungal richness in UK National Park woodlands, revealing concerning declines in ectomycorrhizal diversity in ancient forests. This research establishes SSL satellite features as a scalable tool for creating continuous, high-resolution biodiversity maps, significantly enhancing our ability to monitor the hidden components of terrestrial ecosystems.
Methodology
The authors utilized a Barlow Twins-based self-supervised learning model called Tessera to extract features from Sentinel-1 and Sentinel-2 satellite imagery. They predicted ectomycorrhizal fungal richness using a dataset of approximately 12,000 field samples, systematically evaluating the model's performance against traditional environmental predictors and analyzing geographic patterns of prediction errors.
Results
The models explained over 50% of the variance in ectomycorrhizal fungal richness. SSL-derived features were found to be the most informative predictors, significantly outperforming traditional datasets. The approach allowed for a spatial resolution increase from 1 km to 10 m, facilitating detailed habitat-scale observations and enabling the first temporal monitoring of below-ground biodiversity.
Implications
This research has significant implications for biodiversity conservation and ecological monitoring, providing a scalable and efficient method to map and track below-ground fungal communities. It highlights the potential of remote sensing and machine learning in ecological studies, particularly in addressing the challenges of monitoring hidden biodiversity in terrestrial ecosystems.
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
Large Language Models
- Introduction of QuanBench+, a multi-framework benchmark for quantum code generation.
- Evaluation of LLMs across Qiskit, PennyLane, and Cirq with 42 aligned tasks.
- Performance metrics include Pass@1, Pass@5, and feedback-based repair outcomes.
- Results show that while models perform better with feedback, they still struggle with cross-framework reliability.
Read more
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
Summary
The paper presents QuanBench+, a comprehensive benchmark designed to evaluate Large Language Models (LLMs) in the context of quantum code generation across multiple frameworks, specifically Qiskit, PennyLane, and Cirq. The authors argue that existing benchmarks primarily focus on single frameworks, which complicates the assessment of quantum reasoning capabilities independent of framework familiarity. QuanBench+ introduces 42 aligned tasks that encompass quantum algorithms, gate decomposition, and state preparation, allowing for a fair comparison of model performance across different quantum programming environments. The evaluation methodology includes executable functional tests, reporting metrics such as Pass@1 and Pass@5, and employing KL-divergence-based acceptance for probabilistic outputs. Additionally, the paper explores the impact of feedback-based repair on model performance, revealing that while initial one-shot scores are promising, significant room for improvement remains. The findings indicate that while there has been progress in quantum code generation, the reliability of models across frameworks is still limited and heavily influenced by framework-specific knowledge.
Methodology
The authors developed a unified benchmark, QuanBench+, which includes 42 tasks aligned across three quantum programming frameworks. They employed executable functional tests to evaluate model outputs, using metrics like Pass@1 and Pass@5, and incorporated KL-divergence for assessing probabilistic outputs. The study also investigated the effects of feedback-based repair on model performance.
Results
The strongest one-shot performance scores achieved were 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane. After implementing feedback-based repair, these scores improved significantly to 83.3%, 76.2%, and 66.7%, respectively. Despite these improvements, the results indicate that reliable multi-framework quantum code generation remains a challenge, heavily reliant on specific framework knowledge.
Implications
The findings suggest that while LLMs show potential in generating quantum code, there is a critical need for advancements in their reasoning capabilities to enhance cross-framework reliability. This benchmark can serve as a foundation for future research aimed at improving quantum programming tools and methodologies.
A Closer Look at the Application of Causal Inference in Graph Representation Learning
Graph Learning
Theory
- Aggregation of graph elements into single causal variables violates causal inference assumptions.
- A new theoretical model is proposed that adheres to the premises of causal inference.
- A synthetic dataset mimicking real-world causal structures is created for empirical validation.
- A plug-and-play causal modeling enhancement module is developed for graph learning pipelines.
Read more
A Closer Look at the Application of Causal Inference in Graph Representation Learning
Summary
This paper addresses the challenges of modeling causal relationships in graph representation learning, highlighting the limitations of existing approaches that aggregate diverse graph elements into single causal variables. The authors prove that such aggregation compromises causal validity and propose a new theoretical model based on the smallest indivisible units of graph data to ensure causal validity. They analyze the costs associated with achieving precise causal modeling and identify conditions for simplification. To support their theory, the authors construct a synthetic dataset that reflects real-world causal structures and conduct extensive experiments. Additionally, they develop a causal modeling enhancement module that integrates seamlessly into existing graph learning pipelines, demonstrating its effectiveness through comparative experiments.
Methodology
The authors develop a theoretical model based on indivisible graph data units, conduct analyses and proofs regarding causal modeling, create a synthetic dataset for empirical validation, and implement a causal modeling enhancement module for integration into existing systems.
Results
The proposed model ensures causal validity in graph representation learning, while the synthetic dataset and experiments validate the theoretical findings. The enhancement module shows improved performance in causal relationship modeling within graph learning frameworks.
Implications
This work has significant implications for improving the accuracy of causal inference in graph representation learning, which is crucial for applications in recommendation systems, drug discovery, and social network analysis. The findings can enhance the reliability of AI systems that rely on graph data.
Transformers Learn Latent Mixture Models In-Context via Mirror Descent
NLP
Large Language Models
Theory
- Introduces a framework for in-context learning based on latent variables using Mixture of Transition Distributions.
- Demonstrates that transformers can implement Mirror Descent to learn latent mixture weights from context.
- Proves that the one-step estimator from the transformer is a first-order approximation of the Bayes-optimal predictor.
- Empirical results show that transformers trained from scratch match predictive distributions and attention patterns consistent with the proposed framework.
Read more
Transformers Learn Latent Mixture Models In-Context via Mirror Descent
Summary
This paper investigates how transformers can infer latent structures in sequence modeling through in-context learning (ICL). The authors formalize the task of estimating token importance as an in-context learning problem using a framework based on Mixture of Transition Distributions (MTD). They propose that a latent variable governs the influence of past tokens on the next token, with unobserved mixture weights that transformers learn in-context. The authors demonstrate that transformers can implement Mirror Descent to learn these weights and provide an explicit construction of a three-layer transformer that performs one step of this algorithm. They prove that this estimator approximates the Bayes-optimal predictor and empirically validate that transformers trained from scratch can learn solutions consistent with their theoretical framework. The results show that transformers effectively infer which past tokens are relevant, enhancing their ability to capture higher-order dependencies in language, thus providing a new algorithmic perspective on latent-variable inference in attention-based models.
Methodology
The authors developed a synthetic task based on the Mixture of Transition Distributions model, framing token importance estimation as learning latent mixture weights in-context. They constructed a three-layer transformer that implements one step of Mirror Descent and validated its learnability through empirical experiments.
Results
The empirical validation showed that transformers trained from scratch with Adam optimizer learned solutions that aligned with the theoretical framework, demonstrating predictive distributions and attention patterns that matched the proposed construction. Deeper models achieved performance comparable to multi-step Mirror Descent.
Implications
This work provides insights into how transformers can dynamically infer causal relationships in language, potentially enhancing their application in natural language processing tasks that require understanding complex dependencies. It also offers a new perspective on the interpretability of large language models and their learning mechanisms.
Tracking High-order Evolutions via Cascading Low-rank Fitting
Generative Models
Theory
Efficient ML
- Introduces cascading low-rank fitting for modeling high-order dynamics in generative models.
- Proves that under linear decomposability, the ranks of high-order derivatives are monotonically non-increasing.
- Presents a computationally efficient algorithm for implementing the proposed method.
- Demonstrates applicability to modern attention mechanisms in generative frameworks.
Read more
Tracking High-order Evolutions via Cascading Low-rank Fitting
Summary
This paper addresses the challenge of modeling high-order dynamics in generative modeling, particularly in diffusion models used for visual generation. Traditional methods focus on first-order dynamics, but the authors propose a novel approach called cascading low-rank fitting, which allows for the efficient approximation of higher-order derivatives without the need for separate neural networks for each order. By leveraging ordinary differential equations (ODEs), the method utilizes a shared base function augmented with low-rank components to represent successive derivatives. Theoretical analysis demonstrates that under certain conditions, the ranks of these derivatives are guaranteed to be monotonically non-increasing, thus improving parameter efficiency. The authors also provide an algorithm for efficient computation of this approach, which can be adapted to modern attention mechanisms. This work contributes to the understanding of rank dynamics in high-order derivatives and offers a scalable solution for generative modeling tasks.
Methodology
The authors develop a method called cascading low-rank fitting, which approximates higher-order derivatives using a shared base function and sequentially accumulated low-rank components. The approach is grounded in ordinary differential equations and employs polynomial approximation techniques to enhance its applicability to modern neural architectures.
Results
The theoretical analysis confirms that if the initial matrix difference is linearly decomposable, the ranks of the high-order derivatives will not increase, providing a structured way to manage the complexity of generative models. The proposed algorithm is shown to efficiently compute these derivatives, leading to improved performance in generative tasks.
Implications
This work has significant implications for generative modeling, particularly in video generation, where understanding and efficiently modeling high-order dynamics can enhance the quality and realism of generated outputs. The findings may also influence the design of neural architectures that leverage attention mechanisms.
Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network
Theory
Optimization
Graph Learning
- Friendshoring increases globalization by enhancing supply links among allied countries.
- Country Plus One policy enhances network density through redundant links.
- Reshoring creates challenges in the EV sector due to irreplaceable products.
- The impact of these policies varies across different industries.
Read more
Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network
Summary
This paper investigates the impact of recent policy-based interventions on the global electric vehicle (EV) supply chain network, particularly in the context of rising geopolitical tensions and the need for supply chain resilience. The authors analyze three key policies: Country Plus One, Friendshoring, and Reshoring. The study reveals that Friendshoring unexpectedly increases globalization by enhancing supply links among allied countries, which may lead to higher transaction costs. Similarly, the Country Plus One policy increases network density through redundant connections, while Reshoring presents challenges in the EV sector due to the prevalence of irreplaceable products. The effects of these policies are found to vary across different industries, with mining goods being less affected by the Country Plus One policy compared to Friendshoring. The research highlights the need for a deeper understanding of the structural and systemic consequences of localized supply chain interventions, as they may inadvertently introduce vulnerabilities or inefficiencies into the global supply chain network.
Methodology
The authors conducted a network analysis of the global EV supply chain using supply chain data to evaluate the structural impacts of the Country Plus One, Reshoring, and Friendshoring policies. They examined how these interventions affect network topology, robustness, and the propagation of disruptions.
Results
The analysis showed that Friendshoring and Country Plus One policies lead to increased network density and globalization, while Reshoring poses challenges due to the high number of irreplaceable products in the EV sector. The effects of these policies differ across industries, indicating a complex interplay between localized interventions and global supply chain dynamics.
Implications
The findings suggest that policymakers need to consider the broader structural implications of localized supply chain interventions. Understanding these dynamics is crucial for designing strategies that enhance supply chain resilience without inadvertently creating new vulnerabilities.
Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction
Interpretability
- Tree-based models (XGBoost and LightGBM) outperform standalone SAINT and hybrid models in predictive accuracy.
- Hybrid models did not improve performance and sometimes performed worse than standalone SAINT.
- Tree-based models demonstrated strong generalization, while hybrid models showed performance degradation.
- SAINT embeddings may not align well with tree-based classifiers optimized for structured data.
Read more
Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction
Summary
This paper addresses the challenge of employee attrition prediction, which is crucial for organizations aiming to reduce costs and improve productivity. Traditional machine learning models often struggle with complex feature interactions in tabular HR datasets, particularly when using one-hot encoding for categorical features. The authors propose a hybrid approach that integrates SAINT (Self-Attention and Intersample Attention Transformer) embeddings with tree-based models like XGBoost and LightGBM. The study evaluates both standalone models (SAINT, XGBoost, LightGBM) and hybrid models that utilize SAINT-generated embeddings. Surprisingly, the results indicate that standalone tree-based models outperform both the standalone SAINT model and the hybrid approaches in predictive accuracy and generalization. The hybrid models did not enhance performance and reduced interpretability, suggesting that tree-based classifiers may not effectively utilize high-dimensional embeddings. The findings highlight the need for alternative strategies to combine deep learning with structured data for improved predictive performance.
Methodology
The study employs a comparative analysis of various models, including standalone SAINT, XGBoost, and LightGBM, as well as hybrid models that integrate SAINT-generated embeddings into tree-based classifiers. Performance metrics, generalizability, and interpretability are evaluated through experiments and SHAP analysis.
Results
The experimental results reveal that standalone tree-based models consistently outperform both the standalone SAINT model and the hybrid models. The hybrid models did not enhance predictive accuracy and exhibited reduced interpretability, indicating that the integration of SAINT embeddings may not be beneficial for tree-based classifiers.
Implications
The findings suggest that while transformer-based models like SAINT can capture complex feature relationships, their embeddings may not improve the performance of tree-based models in structured data contexts. This has implications for workforce management, as better attrition prediction can lead to more effective retention strategies and reduced hiring costs.
Joint Interference Detection and Identification via Adversarial Multi-task Learning
Theory
- Introduces a theoretically grounded MTL framework for joint interference detection and identification.
- Derives an upper bound for weighted expected loss linked to task similarity using Wasserstein distance.
- Develops AMTIDIN, which utilizes adversarial training to enhance task correlation modeling.
- Quantitative analysis reveals significant feature overlap between modulation and interference identification tasks.
Read more
Joint Interference Detection and Identification via Adversarial Multi-task Learning
Summary
This paper addresses the critical need for precise interference detection and identification in non-cooperative wireless environments to enhance communication system survivability. The authors highlight the limitations of existing single-task learning (STL) approaches, which overlook the inherent correlations between tasks. They propose a theoretically grounded multi-task learning (MTL) framework that integrates interference detection, modulation identification, and interference identification. The framework derives an upper bound for the weighted expected loss in MTL, linking performance to task similarity quantified by the Wasserstein distance and learnable task relation coefficients. The proposed adversarial multi-task interference detection and identification network (AMTIDIN) employs adversarial training to minimize distributional discrepancies across tasks and dynamically models task correlations. The authors conduct a quantitative analysis revealing significant feature overlap between modulation and interference identification tasks, distinct from interference detection. Extensive experiments demonstrate that AMTIDIN outperforms both STL baselines and state-of-the-art MTL methods, particularly in challenging scenarios with limited training data and low signal-to-noise ratios (SNRs).
Methodology
The authors establish a multi-task learning framework that integrates interference detection, modulation identification, and interference identification. They derive a theoretical upper bound for the expected loss in MTL, employing the Wasserstein distance to quantify task similarity. The AMTIDIN network is designed to minimize distributional discrepancies through adversarial training and uses adaptive coefficients to model task correlations dynamically.
Results
AMTIDIN significantly outperforms task-specific STL baselines and state-of-the-art MTL approaches in terms of robustness and generalization, especially in scenarios with limited training data, short signal lengths, and low SNRs.
Implications
The proposed framework and network can enhance the reliability of communication systems in congested wireless environments, making it applicable in military communications, unlicensed spectrum sharing, and other critical wireless applications.
Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay, Traffic-Driven Reinforcement, and Adaptive Living Hierarchy
Graph Learning
Efficient ML
Optimization
- Mycelium-Index adapts its structure dynamically based on query traffic, improving efficiency and memory usage.
- The system achieves high recall rates while significantly reducing RAM usage compared to existing methods.
- A hybrid deletion strategy enhances performance by efficiently managing cold and hub nodes.
- The study reveals that topological mechanisms are more effective than geometric ones for high-dimensional ANN repair.
Read more
Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay, Traffic-Driven Reinforcement, and Adaptive Living Hierarchy
Summary
The paper introduces the Mycelium-Index, a novel streaming approximate nearest neighbor (ANN) indexing system inspired by the adaptive growth patterns of biological mycelium. This system continuously modifies its topology through mechanisms such as myelial edge decay, traffic-driven reinforcement, and a living hierarchy that adapts based on query traffic. The Mycelium-Index employs a hybrid deletion strategy that combines O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes, allowing it to maintain high performance in dynamic environments. Experimental results on the SIFT-1M dataset show that Mycelium-Index achieves a recall of 0.927 ± 0.028 at k=5, which is comparable to the performance of FreshDiskANN, while utilizing significantly less RAM (88 MB compared to over 500 MB) and achieving a higher queries per second (QPS) rate (2,795 vs. ∼600). The paper also highlights the importance of topological mechanisms in high-dimensional ANN graphs, demonstrating that geometric heuristics are less effective in this context.
Methodology
The Mycelium-Index employs myelial edge decay and reinforcement to adapt its graph topology based on query traffic. It features a living hierarchy that refreshes levels based on accumulated query use counts and utilizes a hybrid storage approach with scalar quantization for memory efficiency. The system supports streaming insertions and deletions through a combination of soft deletes for cold nodes and beam-search repairs for high-traffic hub nodes.
Results
The Mycelium-Index demonstrated a recall of 0.927 ± 0.028 at k=5 on the SIFT-1M dataset, comparable to FreshDiskANN's ∼0.95 recall, while using 5.7× less RAM (88 MB) and achieving 4.7× higher QPS (2,795). In static conditions, it matched HNSW's recall at a fraction of the memory usage (163 MB vs. 854 MB). Performance optimizations led to a cumulative 2.7× improvement in QPS.
Implications
The Mycelium-Index has potential applications in various domains requiring efficient and adaptive ANN search capabilities, such as real-time anomaly detection, recommendation systems, and semantic search. Its ability to dynamically adapt to changing data environments makes it suitable for production systems where data is continuously arriving and departing.
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms
Interpretability
Time Series
Multimodal
- Introduces a unified framework for understanding explainability in HAR systems.
- Presents a mechanism-centric taxonomy categorizing XAI-HAR methods.
- Addresses the complexities of temporal, multimodal, and semantic aspects in HAR.
- Identifies key challenges in the deployment of reliable XAI-HAR systems.
Read more
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms
Summary
This paper provides a comprehensive review of explainable human activity recognition (XAI-HAR) methods, addressing the need for transparency in HAR systems that utilize deep learning on multivariate sensor data. While deep learning has enhanced HAR performance, it has also introduced opacity, which can hinder trust and deployment in critical applications. The authors propose a unified framework that distinguishes between conceptual dimensions of explainability and algorithmic mechanisms, thereby clarifying previous ambiguities in the literature. They present a mechanism-centric taxonomy that categorizes XAI-HAR methods based on their explanation paradigms, examining how these methods tackle the complexities of temporal, multimodal, and semantic aspects of HAR. The review highlights the interpretability objectives, explanation targets, and limitations of existing methods, while also discussing current evaluation practices and challenges in achieving reliable XAI-HAR systems. The authors conclude by outlining future research directions aimed at developing trustworthy activity recognition systems that enhance human understanding and decision-making.
Methodology
The paper employs a structured review methodology, synthesizing existing literature on XAI-HAR. It categorizes methods based on their conceptual and algorithmic dimensions, providing a comparative analysis of their interpretability objectives and limitations.
Results
The review reveals a fragmented landscape of XAI-HAR literature, with a predominance of attribution-based methods. It highlights the need for a more systematic approach to evaluating explanation faithfulness and stability, particularly in the context of HAR's unique challenges.
Implications
The findings suggest that improving explainability in HAR systems can enhance user trust and facilitate deployment in safety-critical environments, such as healthcare and assistive technologies. The proposed framework and taxonomy can guide future research and development in creating more interpretable and reliable HAR systems.
Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections
Time Series
- Introduction of HFD-TM, a hierarchical framework for turning movement prediction.
- Utilization of corridor through-flows to improve prediction accuracy for turning movements.
- Implementation of a physics-informed loss function to enforce flow conservation.
- Demonstrated significant performance improvements over existing models.
Read more
Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections
Summary
This paper presents HFD-TM (Hierarchical Flow-Decomposition for Turning Movement Prediction), a novel hierarchical deep learning framework designed to enhance the accuracy of predicting turning movements at signalized intersections. The authors identify that traditional methods struggle with the volatility of turning movements, which are influenced by various factors such as signal phases and upstream disturbances. HFD-TM addresses this by first predicting corridor through-flows, which are less volatile and account for a significant portion of traffic volume, before expanding these predictions to individual turning streams. This approach is underpinned by a physics-informed loss function that ensures flow conservation, thus maintaining structural consistency in the predictions. The framework was evaluated using six months of LiDAR data from a corridor in Nashville, Tennessee, achieving a mean absolute error (MAE) of 2.49 vehicles per interval. This represents a 5.7% improvement over a Transformer model and a 27.0% improvement over a Gated Recurrent Unit (GRU). The study also highlights that the hierarchical decomposition significantly enhances performance while reducing training time by 12.8 times compared to a Diffusion Convolutional Recurrent Neural Network (DCRNN), making it suitable for real-time traffic applications.
Methodology
The HFD-TM framework employs a hierarchical modeling approach where corridor-level through movements are predicted first. This is followed by a turning movement expansion module that integrates corridor predictions with time-of-day embeddings to generate full turning streams. A refinement stage is included to ensure temporal continuity and geometric feasibility through residual correction and zero-movement masking.
Results
HFD-TM achieved a mean absolute error of 2.49 vehicles per interval, outperforming both Transformer and GRU models by 5.7% and 27.0%, respectively. The hierarchical decomposition approach was found to provide the largest performance gain, and the model's training time was significantly lower than that of DCRNN, indicating its efficiency for real-time traffic applications.
Implications
The findings suggest that HFD-TM can be effectively utilized in adaptive traffic signal control systems, enhancing the efficiency of urban traffic management. Its ability to accurately predict turning movements can lead to improved traffic flow and reduced congestion at signalized intersections.
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
Reinforcement Learning
Large Language Models
NLP
- SCOPE introduces a dual-path adaptive framework for on-policy reinforcement learning.
- The framework distinguishes between correct and incorrect trajectories to apply tailored supervision.
- Empirical analysis reveals the importance of signal quality in OPD, leading to improved learning outcomes.
- SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines.
Read more
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
Summary
The paper introduces SCOPE, a novel framework designed to enhance On-Policy Distillation (OPD) in reinforcement learning, particularly for large language models (LLMs). Traditional OPD methods apply uniform token-level supervision from a teacher model, which overlooks the varying quality of guidance across different rollouts. SCOPE addresses this by implementing a dual-path adaptive training approach that routes rollouts based on correctness. Incorrect trajectories receive teacher-perplexity-weighted KL distillation, prioritizing instances where the teacher model exhibits strong corrective capabilities. Conversely, correct trajectories are handled through student-perplexity-weighted Maximum Likelihood Estimation (MLE), focusing reinforcement on low-confidence samples at the capability boundary. This method not only improves the learning efficiency but also adapts the weight distributions based on the intrinsic difficulty of the prompts. The authors validate SCOPE through extensive experiments across six reasoning benchmarks, demonstrating significant improvements over existing methods.
Methodology
SCOPE employs a dual-path adaptive training framework that routes on-policy rollouts based on correctness. It utilizes teacher-perplexity-weighted KL distillation for incorrect trajectories and student-perplexity-weighted MLE for correct trajectories. A normalization mechanism is also implemented to adaptively calibrate weight distributions, addressing the variance in signal quality.
Results
SCOPE demonstrated an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 across six reasoning benchmarks compared to competitive baselines, indicating its effectiveness in enhancing learning outcomes.
Implications
The findings suggest that incorporating signal quality awareness in reinforcement learning can significantly improve model performance, particularly in complex reasoning tasks. This approach may be applicable to various domains where on-policy learning is utilized, enhancing the robustness and efficiency of large language models.
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
Computer Vision
- Introduces the Latent Rehearsal Decay hypothesis to explain performance drops in OCSSL.
- Develops two novel metrics, Overlap and Deviation, to diagnose latent space degradation.
- Proposes SOLAR, a method that combines a Deviation-Aware Buffer and Overlap Loss for adaptive plasticity management.
- Demonstrates SOLAR's effectiveness through extensive experiments, achieving state-of-the-art results.
Read more
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
Summary
This paper investigates Online Continual Self-Supervised Learning (OCSSL), focusing on the challenges posed by continuous streams of unlabeled, non-stationary data. The authors highlight the critical stability-plasticity trade-off in OCSSL, where stable methods like Reservoir sampling converge faster but may lead to performance drops due to a phenomenon termed Latent Rehearsal Decay. This decay occurs when the latent space becomes overspecialized, hindering the model's ability to adapt to new tasks. To diagnose this issue, the authors introduce two metrics: Overlap and Deviation, which correlate with accuracy declines. The proposed method, SOLAR (Self-supervised Online Latent-Aware Replay), employs adaptive regularization to manage plasticity and prevent latent decay by optimizing these metrics through efficient online proxies. Experimental results demonstrate that SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, exhibiting both rapid convergence and high final accuracy.
Methodology
The authors propose SOLAR, which integrates a Deviation-Aware Buffer and an Overlap Loss to modulate plasticity and maintain latent space quality. The method utilizes efficient online proxies for the metrics to guide buffer management without explicitly constraining network updates.
Results
SOLAR outperforms existing methods on OCSSL vision benchmarks, achieving faster convergence and higher final performance. The experiments validate the effectiveness of the proposed metrics in diagnosing and preventing Latent Rehearsal Decay.
Implications
The findings suggest that managing the stability-plasticity trade-off is crucial for effective online continual learning, especially in applications involving continuous streams of unlabeled data, such as satellite imagery and other real-world scenarios.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Reinforcement Learning
Generative Models
Robotics
- Introduction of Hierarchical Implicit Flow Q-Learning (HIFQL) for offline GCRL.
- Utilization of mean flow policies to enhance expressiveness and efficiency in hierarchical policy learning.
- Implementation of a LeJEPA loss to improve goal representation and generalization.
- Strong performance demonstrated on OGBench benchmark across various tasks.
Read more
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Summary
This paper introduces Hierarchical Implicit Flow Q-Learning (HIFQL), a novel approach to offline goal-conditioned reinforcement learning (GCRL) that addresses the challenges of long-horizon control. Traditional methods like Hierarchical Implicit Q-Learning (HIQL) struggle with the expressiveness of Gaussian policies and the generation of effective subgoals. HIFQL enhances the hierarchical policy structure by incorporating a goal-conditioned mean flow policy, which utilizes an average velocity field to model complex target distributions for both high-level and low-level policies. This allows for efficient one-step action generation. Additionally, the authors propose a LeJEPA loss to improve goal representation, encouraging more discriminative embeddings that enhance generalization. Experimental evaluations on the OGBench benchmark demonstrate that HIFQL outperforms existing methods in both state-based and pixel-based tasks, showcasing its effectiveness in long-horizon offline GCRL scenarios.
Methodology
HIFQL extends the HIQL framework by replacing unimodal Gaussian policies with expressive mean flow policies at both the high and low levels. It employs a learned average velocity field to capture complex target distributions, enabling efficient one-step action generation. The LeJEPA loss is integrated to enhance goal representation learning, leading to better subgoal prediction and overall policy performance.
Results
HIFQL achieved superior performance compared to existing methods on the OGBench benchmark, excelling in both state-based and pixel-based tasks. The results indicate that the proposed method effectively addresses the challenges of long-horizon control in offline GCRL.
Implications
The advancements presented in HIFQL could significantly improve the efficiency and effectiveness of offline GCRL applications, particularly in complex environments where long-horizon decision-making is critical. This could have implications for robotics, autonomous systems, and other areas requiring goal-oriented learning from static datasets.
Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Computer Vision
- Introduces RAS, a hyperparameter-free method for OoD detection that enhances activation shifts.
- Demonstrates consistent performance across different datasets and model architectures.
- Identifies the limitations of existing scaling-based methods in handling unrectified activations.
- Shows that both inhibitory and excitatory shifts independently improve OoD detection.
Read more
Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Summary
This paper addresses the challenges of post-hoc out-of-distribution (OoD) detection methods, which often show inconsistent performance across different datasets and models. The authors identify that this instability is primarily due to variations in activation distributions and highlight a failure mode in scaling-based methods when penultimate layer activations are not rectified. To overcome these issues, they propose a novel method called Ranked Activation Shift (RAS), which is hyperparameter-free and replaces sorted activation magnitudes with a fixed in-distribution reference profile. RAS demonstrates strong and consistent performance across various datasets and architectures without requiring hyperparameter tuning, while maintaining in-distribution classification accuracy. The authors further analyze the factors contributing to RAS's effectiveness, revealing that both inhibitory and excitatory activation shifts play independent roles in enhancing out-of-distribution discrimination.
Methodology
The methodology involves analyzing the activation distributions of penultimate layer outputs and proposing RAS, which replaces the sorted activation magnitudes with a fixed reference profile derived from in-distribution data. This approach does not require hyperparameter tuning and is applicable across various model architectures, including those with negative activations.
Results
The results indicate that RAS significantly improves out-of-distribution detection performance compared to existing methods, achieving robust results without the need for hyperparameter optimization. The analysis reveals that the method effectively enhances the discriminative power of the model by adjusting activation shifts.
Implications
The findings suggest that RAS can be integrated into existing machine learning pipelines for safer AI systems, particularly in high-stakes applications like autonomous driving and medical imaging, where reliable OoD detection is crucial.
ANTIC: Adaptive Neural Temporal In-situ Compressor
Efficient ML
Time Series
Theory
- Introduction of ANTIC, a novel in-situ compression framework for multi-rate/stiff PDE simulations.
- Utilization of physics-aware metrics for selecting salient temporal snapshots.
- Implementation of neural spatial compression through continual fine-tuning of residuals.
- Demonstrated significant storage reductions while preserving physics accuracy.
Read more
ANTIC: Adaptive Neural Temporal In-situ Compressor
Summary
The paper introduces ANTIC (Adaptive Neural Temporal In-situ Compressor), an innovative framework designed to address the significant storage challenges posed by high-resolution, spatiotemporally evolving fields governed by large-scale partial differential equations (PDEs). As simulations in fields like computational fluid dynamics and plasma physics generate data volumes that can reach petabytes to exabytes, traditional storage solutions become inadequate. ANTIC combines an adaptive temporal selector that identifies and filters informative snapshots during simulation with a spatial neural compression module that utilizes continual fine-tuning to learn residual updates between adjacent snapshots. This dual approach allows for in situ compression, eliminating the need for extensive on-disk storage of entire time-evolved trajectories. The framework's effectiveness is demonstrated through experiments on turbulent 2D Kolmogorov flows and a 3D binary black hole merger simulation, showcasing substantial reductions in storage requirements while maintaining high fidelity in physics reconstruction.
Methodology
ANTIC employs a two-stage approach: a Physics-aware Temporal Selector that filters snapshots based on their relevance to the underlying physics, and a Spatial Neural Compression module that learns residuals between snapshots using continual fine-tuning of neural fields. This allows for effective in situ compression of both temporal and spatial components during simulation.
Results
In experiments, ANTIC achieved a 62% reduction in the number of compressed snapshots and a spatial compression factor of 47× for 2D Kolmogorov turbulence, resulting in a net compression of up to 435×. For the 3D binary black hole merger simulation, it achieved a temporal reduction of approximately 45% and spatial compression ratios up to 3744×, leading to a total spatiotemporal compression of up to 6807×.
Implications
The ANTIC framework has the potential to significantly enhance the efficiency of data storage in high-performance computing environments, particularly for scientific simulations that generate vast amounts of data. By enabling effective in situ compression, it can facilitate broader adoption of advanced scientific workflows and improve the scalability of simulations in various fields such as fluid dynamics, astrophysics, and climate modeling.
A Diffusion-Contrastive Graph Neural Network with Virtual Nodes for Wind Nowcasting in Unobserved Regions
Graph Learning
Time Series
- Introduces a novel framework (ContraVirt) for wind nowcasting in unobserved regions using virtual nodes.
- Achieves a significant reduction in mean absolute error (MAE) of wind predictions by 30% to 46% compared to traditional methods.
- Utilizes contrastive learning strategies to enhance model robustness and representation in data-scarce areas.
- Grounded in geographic principles, the model effectively learns from neighboring observed regions.
Read more
A Diffusion-Contrastive Graph Neural Network with Virtual Nodes for Wind Nowcasting in Unobserved Regions
Summary
This paper addresses the challenge of accurate wind nowcasting in regions lacking observational data, which is critical for climate resilience and disaster preparedness. The authors propose a novel framework called ContraVirt, a deep graph self-supervised learning model that incorporates 'virtual nodes' into a diffusion and contrastive-based graph neural network. This approach allows the model to infer wind conditions (speed, direction, and gusts) in unobserved areas without the need for new sensors. By leveraging high-temporal resolution weather station data from the Netherlands, the authors demonstrate that ContraVirt significantly reduces the mean absolute error (MAE) of wind predictions in unobserved regions by 30% to 46% compared to traditional interpolation and regression methods. The framework utilizes geographic and atmospheric principles to enhance its predictive capabilities, making it a promising solution for localized nowcasting in data-sparse regions, thereby facilitating renewable energy integration and improving agricultural planning and early-warning systems.
Methodology
The ContraVirt framework employs a graph neural network architecture that integrates virtual nodes representing unobserved regions. It uses diffusion processes to propagate information from observed to unobserved areas and applies contrastive learning strategies to align virtual nodes with real observations over time. This self-supervised learning approach captures the dynamics of wind conditions while maintaining physical consistency with the observed data.
Results
The implementation of ContraVirt on high-temporal resolution weather data from the Netherlands demonstrated a substantial improvement in wind nowcasting accuracy, with a reduction in MAE of 30% to 46% compared to conventional interpolation and regression techniques. This indicates the model's effectiveness in extending nowcasting capabilities into regions lacking direct measurements.
Implications
The findings suggest that ContraVirt can significantly enhance localized weather predictions in data-sparse regions, which is crucial for renewable energy management, agricultural planning, and disaster preparedness. This approach opens new avenues for improving climate resilience and ensuring timely access to weather information in vulnerable areas.
A Temporally Augmented Graph Attention Network for Affordance Classification
Graph Learning
Time Series
- Introduction of EEG-tGAT, a temporally enhanced GAT for EEG affordance classification.
- Incorporation of temporal attention and dropout to address non-uniform temporal dynamics in EEG data.
- Demonstrated improved classification performance over traditional GATv2 models.
- Findings suggest that temporal modeling aligns better with the neurocognitive processes involved in affordance perception.
Read more
A Temporally Augmented Graph Attention Network for Affordance Classification
Summary
This paper presents the Electroencephalography-temporal Graph Attention Network (EEG-tGAT), an advanced version of Graph Attention Network (GATv2) specifically designed for affordance classification from EEG interaction sequences. Traditional GATs primarily handle static graphs and do not adequately address the temporal dynamics inherent in EEG data. The EEG-tGAT model introduces temporal attention mechanisms to emphasize the significance of various time segments and employs temporal dropout to enhance learning robustness across correlated observations. The authors argue that affordance data is temporally non-uniform, necessitating a model that can adaptively learn from these temporal variations. Experimental evaluations on affordance datasets demonstrate that EEG-tGAT outperforms GATv2 in classification tasks, indicating that the incorporation of temporal importance and robustness aligns better with the structure of affordance-driven interactions. This work highlights that even modest modifications to graph attention architectures can yield substantial benefits when temporal relationships are critical to the task at hand.
Methodology
The EEG-tGAT model integrates temporal attention to weigh the importance of different time segments in EEG data and employs temporal dropout to regularize learning. The model treats EEG electrodes as nodes in a graph, allowing for adaptive learning of inter-channel relationships through attention mechanisms. This approach contrasts with traditional methods that rely on fixed temporal aggregation or handcrafted features.
Results
Experimental results indicate that EEG-tGAT significantly improves classification accuracy on affordance datasets compared to GATv2. The model's ability to explicitly encode temporal importance and enforce robustness leads to better alignment with the underlying structure of affordance-related interactions.
Implications
The findings suggest that EEG-tGAT can enhance automated classification of affordances from EEG signals, potentially benefiting applications in neurophysiological research, brain-computer interfaces, and cognitive neuroscience. The model's design may also inspire future research in other domains where temporal dynamics are crucial.
FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes
Generative Models
- FluidFlow utilizes conditional flow-matching for scalable fluid dynamics surrogate modeling.
- The model operates directly on unstructured meshes without requiring mesh interpolation.
- FluidFlow outperforms traditional multilayer perceptron models in accuracy and generalization.
- The transformer architecture allows for efficient learning from large datasets.
Read more
FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes
Summary
The paper introduces FluidFlow, a novel generative model designed for creating fluid dynamics surrogates on unstructured meshes. Traditional computational fluid dynamics (CFD) simulations are often computationally expensive, particularly in applications requiring multiple queries. This work shifts from conventional supervised learning approaches to generative modeling, specifically employing conditional flow-matching techniques. FluidFlow learns deterministic transport maps between noise and data distributions, enabling it to operate directly on CFD data without the need for mesh interpolation, thus preserving geometric fidelity. The model is evaluated using two neural network architectures: a U-Net and a diffusion transformer (DiT), conditioned on key physical parameters such as Mach number and angle of attack. The methodology is validated through two benchmark problems: predicting pressure coefficients on an airfoil and predicting pressure and friction coefficients on a three-dimensional aircraft geometry. FluidFlow demonstrates superior performance compared to multilayer perceptron baselines, achieving lower error metrics and better generalization across varying conditions. The transformer architecture particularly excels in handling large unstructured datasets while maintaining high accuracy, showcasing the potential of flow-matching generative models in fluid dynamics applications.
Methodology
FluidFlow employs a generative modeling approach based on conditional flow-matching, which learns deterministic transport maps from noise to data distributions. The model is implemented using two neural network architectures: a U-Net and a diffusion transformer, with training conditioned on physically relevant parameters.
Results
FluidFlow achieved significantly lower error metrics compared to multilayer perceptron baselines in both benchmark tests. The transformer-based architecture demonstrated enhanced scalability and predictive accuracy when applied to large unstructured datasets.
Implications
The findings suggest that flow-matching generative models can effectively serve as surrogate models in fluid dynamics, potentially transforming engineering and scientific applications by providing rapid and accurate flow predictions.
Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
Theory
Efficient ML
- Introduces EnsembleCert, the first white-box certification framework for partition-aggregation ensembles.
- Develops ScaLabelCert, enabling exact certification of neural networks against label-flipping attacks.
- Demonstrates significant improvements in certified robustness over existing black-box methods.
- Reduces the number of required partitions for effective certification, enhancing efficiency.
Read more
Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
Summary
This paper addresses the vulnerability of supervised learning models to label-flipping attacks, which can corrupt training labels and lead to misclassifications during inference. The authors propose EnsembleCert, a novel certification framework for partition-aggregation ensembles that utilizes white-box knowledge of base classifiers, resulting in tighter robustness guarantees compared to existing black-box approaches. The framework operates in two steps: first, it extracts white-box certificates from each base classifier for specific data partitions, and then aggregates these certificates to produce ensemble-level guarantees efficiently. To facilitate this, the authors introduce ScaLabelCert, a method that leverages the neural tangent kernel to derive exact, polynomial-time calculable certificates for neural networks against label-flipping attacks. The results demonstrate that EnsembleCert can certify up to 26.5% more label flips on the CIFAR-10 dataset compared to existing methods while requiring significantly fewer partitions, challenging the notion that extensive partitioning is necessary for robust certification.
Methodology
The methodology involves a two-step approach: (1) extracting white-box certificates from base classifiers for each partition using ScaLabelCert, and (2) aggregating these certificates to derive ensemble-level guarantees formulated as an Integer Program, which can be solved in polynomial time.
Results
The proposed EnsembleCert framework outperforms existing black-box certification methods, achieving up to 26.5% more certified label flips on the CIFAR-10 dataset while requiring 100 times fewer partitions, demonstrating enhanced efficiency and robustness.
Implications
The findings suggest that leveraging white-box knowledge can significantly improve the robustness of machine learning models against label poisoning, potentially leading to more reliable AI systems in critical applications where data integrity is paramount.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
NLP
Large Language Models
Efficient ML
- Fine-tuning SLMs significantly improves their performance for domain-specific code generation.
- SLMs provide a resource-efficient alternative to LLMs, particularly in production environments with strict latency requirements.
- The study demonstrates successful adaptation of fine-tuned models for customer-specific scenarios without performance loss.
- Load testing and real-world deployment confirm the effectiveness of the proposed approach.
Read more
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Summary
This paper explores the use of Small Language Models (SLMs) for generating domain-specific code from natural language inputs, addressing the limitations of large language models (LLMs) in production environments. The authors highlight that while LLMs excel in generalization, they face challenges related to latency and computational costs, making them less suitable for resource-constrained applications. In contrast, SLMs, which are significantly smaller, can be fine-tuned to improve their performance on specific tasks. The study builds on a previous implementation of a retrieval-augmented generation (RAG) pipeline and evaluates the fine-tuning of SLMs, particularly variants of Mistral, on a dataset of natural language and code pairs. The results indicate that fine-tuned SLMs outperform larger models in terms of both performance and latency. The authors also demonstrate that the fine-tuned models can be adapted for customer-specific scenarios without degrading general performance. Load testing and deployment in production confirmed optimal performance, suggesting that task-specific fine-tuning of SLMs offers a faster, cost-effective, and adaptive alternative to LLMs for domain-specific language generation.
Methodology
The authors employed parameter-efficient fine-tuning (PEFT) techniques, specifically Low-Rank Adaptation (LoRA), to adapt SLMs for generating domain-specific languages (DSLs) from natural language inputs. They compared the performance of fine-tuned SLMs against larger models and evaluated their effectiveness through load testing and deployment in production environments.
Results
The fine-tuned SLMs demonstrated superior performance and lower latency compared to larger models on test datasets. The best-performing SLM was able to be fine-tuned with minimal harmful data while maintaining compliance with Responsible AI principles. Additionally, the models were successfully adapted for specific customer scenarios without degrading overall performance.
Implications
The findings suggest that fine-tuned SLMs can effectively meet the demands of production systems for domain-specific code generation, offering a viable alternative to LLMs. This approach can enhance the efficiency of natural language processing applications in various domains, particularly in low-code platforms and automation workflows.
AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning
Optimization
- AdaCubic adapts the cubic term weight dynamically, enhancing optimization efficiency.
- Utilizes Hutchinson's method for Hessian approximation, reducing computational overhead.
- Demonstrates superior performance compared to existing optimizers in multiple domains.
- Does not require hyperparameter fine-tuning, making it user-friendly.
Read more
AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning
Summary
The paper introduces AdaCubic, a novel adaptive cubic regularization optimizer designed for deep learning applications. This optimizer dynamically adjusts the weight of the cubic term in Newton's cubic regularized method through an auxiliary optimization problem with cubic constraints. By employing Hutchinson's method for Hessian approximation, AdaCubic significantly reduces computational costs while maintaining the local convergence guarantees of traditional cubic regularization methods. The authors demonstrate that AdaCubic outperforms or competes favorably with several existing optimizers across various tasks in Computer Vision, Natural Language Processing, and Signal Processing. Notably, AdaCubic does not require hyperparameter fine-tuning, making it particularly appealing for scenarios where such tuning is impractical. This work marks the first application of cubic regularization in scalable deep learning contexts, providing a robust alternative for researchers and practitioners.
Methodology
The methodology involves formulating an auxiliary optimization problem to adapt the cubic term's weight in the cubic regularization framework. The authors employ Hutchinson's method to approximate the Hessian matrix, which allows for efficient computation while ensuring convergence properties are preserved. The theoretical foundations are supported by lemmas and theorems that detail the adaptation process and convergence guarantees.
Results
Experimental results indicate that AdaCubic consistently outperforms or matches the performance of several widely used optimizers across diverse tasks in Computer Vision, NLP, and Signal Processing. The optimizer's ability to function effectively without the need for hyperparameter tuning is highlighted as a significant advantage.
Implications
AdaCubic's design makes it a practical choice for deep learning practitioners, especially in environments where computational resources are limited or hyperparameter tuning is not feasible. Its successful application of cubic regularization could inspire further research into adaptive optimization techniques in machine learning.
A Mechanistic Analysis of Looped Reasoning Language Models
Large Language Models
Theory
Interpretability
- Looped language models tend toward cyclic fixed-point behavior, stabilizing attention patterns.
- Recurrent blocks learn stages of inference that closely resemble those in feedforward models.
- Architectural choices significantly influence the emergence and stability of cyclic fixed points.
- Empirical evidence shows that stable models maintain consistent inference stages, while unstable models deviate.
Read more
A Mechanistic Analysis of Looped Reasoning Language Models
Summary
This paper investigates the internal dynamics of looped reasoning language models (LLMs), which enhance reasoning capabilities by looping layers in the latent dimension. The authors conduct a mechanistic analysis to compare the stages of inference in looped models versus standard feedforward models. They demonstrate that many looped models converge to distinct fixed points, leading to stable attention-head behavior across recurrences. The study reveals that recurrent blocks learn inference stages that mirror those of feedforward models, repeating these stages with each iteration. The authors also explore how architectural choices, such as recurrent block size and input injection, affect the emergence and stability of these cyclic fixed points. Their findings provide insights that can guide architectural design in future LLMs.
Methodology
The authors analyze the latent states of looped language models through empirical experiments and theoretical proofs. They examine the behavior of recurrent blocks and their convergence to fixed points, using a combination of mechanistic analysis and architectural experimentation.
Results
The study finds that looped models exhibit cyclic trajectories in latent space, with attention-head behavior stabilizing as fixed points are reached. The analysis shows that these models self-organize into distinct stages of inference during training, and that stable models maintain these stages effectively.
Implications
The insights from this analysis can inform the design of future LLM architectures, potentially leading to more effective reasoning capabilities in AI systems. Understanding the mechanisms behind looped reasoning can enhance the interpretability and performance of language models.
Gradient-Variation Regret Bounds for Unconstrained Online Learning
Theory
Optimization
- Development of the first fully-adaptive algorithm for gradient-variation online learning in unbounded domains.
- Introduction of a new definition of gradient variation that is effective for arbitrary comparators.
- Algorithms achieve regret bounds that do not require prior knowledge of comparator norms or other parameters.
- Efficient computation with closed-form updates, ensuring linear time complexity per round.
Read more
Gradient-Variation Regret Bounds for Unconstrained Online Learning
Summary
This paper presents a novel approach to unconstrained online learning by developing parameter-free algorithms that achieve regret bounds based on gradient variation. The authors introduce a new definition of gradient variation suitable for unbounded domains, allowing their algorithms to operate without prior knowledge of the comparator norm, Lipschitz constant, or smoothness parameter. The proposed algorithms are fully adaptive, efficient, and utilize a closed-form update that operates in linear time per round. The results extend to dynamic regret scenarios and have significant implications for the stochastically-extended adversarial (SEA) model, outperforming previous methods in terms of computational efficiency and regret bounds.
Methodology
The authors redefine gradient variation for unbounded domains and develop algorithms that adaptively adjust to the unknown parameters of the problem. They utilize a closed-form update mechanism that allows for efficient computation, ensuring that the algorithms can operate effectively without requiring prior knowledge of key parameters.
Results
The proposed algorithms achieve regret bounds of the form eO(∥u∥√VT(u) + L∥u∥² + G⁴), which are competitive with existing methods while being fully adaptive and parameter-free. The results demonstrate significant improvements over previous algorithms, particularly in the context of dynamic regret and the SEA model.
Implications
The findings of this paper have important implications for online learning applications where the feasible domain is unbounded. The algorithms can be applied in various settings, including adaptive learning systems and optimization problems where prior knowledge of parameters is not available, thus broadening the applicability of online learning techniques.
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Federated Learning
- FEAT addresses inter-client heterogeneity and class imbalance in exemplar replay-based FCL.
- The method includes a geometric structure alignment for consistent feature representation across clients.
- An energy-based correction improves model sensitivity to minority classes.
- FEAT shows significant performance improvements over existing state-of-the-art methods.
Read more
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Summary
This paper addresses the challenges of catastrophic forgetting in Federated Continual Learning (FCL) by proposing a novel method called Federated gEometry-Aware correcTion (FEAT). Existing exemplar replay methods focus primarily on selecting important samples but often neglect how to effectively utilize these samples, especially under conditions of continual dynamic heterogeneity. FEAT introduces two main components: the Geometric Structure Alignment module, which aligns feature representations with fixed Equiangular Tight Frame prototypes to ensure geometric consistency across clients, and the Energy-based Geometric Correction module, which mitigates prediction bias towards majority classes by removing task-irrelevant components from feature embeddings. The proposed method enhances the model's robustness and sensitivity to minority classes, improving performance in class-imbalanced scenarios. Extensive experiments demonstrate that FEAT outperforms seven state-of-the-art methods across various datasets, confirming its effectiveness in addressing the identified challenges in exemplar replay-based FCL.
Methodology
The methodology involves two key modules: (1) Geometric Structure Alignment, which distills relational geometry by aligning local feature representations with globally shared prototypes, and (2) Energy-based Geometric Correction, which debiases feature embeddings during inference to enhance sensitivity to minority classes. The approach is designed to harmonize local and global learning objectives in a federated setting.
Results
Experimental results indicate that FEAT consistently outperforms seven state-of-the-art methods across three datasets with varying levels of heterogeneity, achieving notable gains in Top-1 accuracy. The results validate the effectiveness of the geometric distillation and debiasing techniques in improving representation consistency and robustness under class-imbalanced distributions.
Implications
The findings suggest that FEAT can be applied to enhance federated learning systems in real-world applications where data heterogeneity and class imbalance are prevalent, such as in healthcare, finance, and IoT devices. This method could lead to more reliable and accurate models in dynamic environments.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
NLP
Large Language Models
Interpretability
- Introduction of CONFIDE, a conformal prediction framework for transformer models.
- Achieves up to 4.09% improvement in test accuracy and greater correct efficiency over existing methods.
- Demonstrates that early and intermediate transformer layers provide better-calibrated representations.
- Offers robustness and interpretability in high-stakes applications where traditional softmax-based uncertainty fails.
Read more
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
Summary
This paper introduces CONFIDE, a novel framework for uncertainty quantification in transformer-based language models, specifically designed to enhance interpretability and reliability in predictions. The authors highlight the limitations of traditional neural networks, particularly their black-box nature, which hinders trust in high-stakes applications. By applying conformal prediction (CP) to the internal embeddings of encoder-only architectures like BERT and RoBERTa, CONFIDE allows for hyper-parameter tuning and the generation of statistically valid prediction sets with instance-level explanations. The framework utilizes class-conditional nonconformity scores derived from either [CLS] token embeddings or flattened hidden states. Empirical results demonstrate that CONFIDE improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency compared to existing methods such as NM2 and VanillaNN. The findings suggest that early and intermediate transformer layers yield better-calibrated representations for conformal prediction, making CONFIDE a robust and interpretable solution for resource-constrained models and high-stakes tasks with ambiguous labels.
Methodology
The methodology involves applying conformal prediction to the embeddings of transformer models, specifically focusing on encoder-only architectures. The framework constructs class-conditional nonconformity scores using [CLS] token embeddings or flattened hidden states, allowing for hyper-parameter tuning and generating statistically valid prediction sets.
Results
CONFIDE demonstrated an improvement of up to 4.09% in test accuracy on BERT-tiny and achieved higher correct efficiency compared to NM2 and VanillaNN. The framework also showed that certain transformer layers yield better-calibrated representations, enhancing the interpretability of predictions.
Implications
The implications of this work suggest that CONFIDE can be effectively utilized in high-stakes applications requiring reliable uncertainty quantification and interpretability, such as healthcare and finance, where understanding model decisions is crucial.
Fairboard: a quantitative framework for equity assessment of healthcare models
Computer Vision
- Fairboard provides a comprehensive framework for assessing equity in healthcare AI models.
- Patient identity and clinical factors significantly influence model performance more than model architecture.
- Spatial biases in model performance are identified, revealing compartment-specific inequities.
- Newer models show improvements in equity, but none offer formal fairness guarantees.
Read more
Fairboard: a quantitative framework for equity assessment of healthcare models
Summary
The paper presents Fairboard, a novel framework designed to quantitatively assess the equity of healthcare models, specifically focusing on brain tumor segmentation. Despite the proliferation of AI medical devices, formal equity assessments remain scarce. The authors evaluate 18 open-source brain tumor segmentation models using data from 648 glioma patients across two independent datasets, totaling 11,664 model inferences. The study reveals that patient identity accounts for more performance variance than model choice, with clinical factors such as molecular diagnosis and tumor grade being more predictive of segmentation accuracy than the architecture of the model itself. A voxel-wise spatial meta-analysis uncovers neuroanatomically localized biases that are consistent across models. The research highlights that while newer models show improved equity, none guarantee fairness. The authors also introduce Fairboard, an open-source, no-code dashboard that facilitates equitable model monitoring in medical imaging, allowing users to assess model fairness without programming expertise.
Methodology
The authors conducted an extensive evaluation of 18 brain tumor segmentation models across four equity dimensions: univariate group comparisons, multivariable regression, spatial meta-analysis, and high-dimensional representational equity analysis. This involved analyzing data from two independent datasets of glioma patients and generating model-specific equity profiles.
Results
The study found that patient identity explained more variance in model performance than the choice of model architecture. Clinical factors were more predictive of segmentation accuracy than the models themselves. The voxel-wise analysis revealed consistent spatial biases across models, and the introduction of Fairboard allows for standardized equity assessments.
Implications
The findings underscore the need for routine equity evaluations in AI healthcare models to ensure fair treatment across diverse patient populations. Fairboard can serve as a critical tool for researchers and clinicians to monitor and improve the fairness of medical imaging models.
Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks
Robotics
Graph Learning
Theory
- Introduces a bipartite graph model for perception pipelines in CPSs.
- Develops the LASE-AD algorithm for maintaining beliefs over sensor attack states.
- Proposes an active probing strategy to enhance detection of compromised sensors.
- Demonstrates significant performance improvements over traditional methods in experimental settings.
Read more
Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks
Summary
This paper presents a novel framework that integrates sensor attack detection and recovery in cyber-physical systems (CPSs) by employing Active Bayesian Inference. The authors model complex perception pipelines as bipartite graphs, which, when combined with alerts from anomaly detectors, form a Bayesian network that infers compromised sensors. A key innovation is the introduction of an active probing strategy that exploits system nonlinearities to enhance the distinguishability between different attack hypotheses. This strategy allows for the selective disabling of compromised sensors, thereby maintaining reliable state estimation. The authors propose a threshold-based probing policy and validate its effectiveness through a simplified partially observable Markov decision process (POMDP) formulation. Experimental results on an inverted pendulum demonstrate that their approach significantly outperforms existing outlier-robust and prediction-based methods, particularly during prolonged sensor attacks, thus bridging the gap between detection and recovery in CPSs.
Methodology
The methodology involves modeling the perception pipeline as a bipartite graph that connects sensors to state estimate components. The framework utilizes Bayesian inference to interpret anomaly detector alerts and employs an active probing strategy that maximizes the distinguishability of attack hypotheses by exploiting system nonlinearities. The theoretical foundation is supported by a simplified POMDP analysis.
Results
The experimental results indicate that the proposed method outperforms existing outlier-robust and prediction-based baselines, particularly in scenarios involving prolonged sensor attacks. The LASE-AD algorithm effectively maintains accurate state estimation by selectively disabling compromised sensors.
Implications
The findings have significant implications for the design of robust control systems in safety-critical applications, such as autonomous vehicles and unmanned aerial systems, where sensor integrity is paramount. The framework can enhance resilience against cyber-attacks, ensuring reliable operation even in the presence of compromised sensors.
Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables
Theory
- PH-CS allows for post-hoc adjustments to candidate selection based on observed data, overcoming the limitations of fixed FDR levels.
- The method provides a path of candidate sets with associated FDP estimates, enabling flexible decision-making based on user-defined utility.
- PH-CS guarantees that the average estimated FDP is a valid upper bound on the true FDR, ensuring statistical validity.
- Experiments show that PH-CS can satisfy user utility constraints while maintaining competitive performance compared to traditional CS.
Read more
Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables
Summary
This paper addresses the limitations of existing conformal selection (CS) methods that require a fixed false discovery rate (FDR) before observing data. The authors introduce a novel approach called post-hoc conformal selection (PH-CS), which allows users to adaptively select candidate sets based on observed data while controlling the false discovery proportion (FDP). PH-CS generates a path of candidate selection sets, each associated with a data-driven FDP estimate, enabling users to choose an optimal operating point that balances selection size and FDR according to their specific utility needs. The methodology leverages conformal e-variables and the e-Benjamini-Hochberg procedure, ensuring that the estimated FDP serves as a valid upper bound on the true FDR. Experiments on synthetic and real-world datasets demonstrate that PH-CS consistently meets user-defined utility constraints while maintaining reliable FDP estimates and competitive FDR control, outperforming traditional CS methods.
Methodology
The authors develop PH-CS by utilizing conformal e-variables and the e-Benjamini-Hochberg procedure to create a path of candidate selection sets with corresponding FDP estimates. This allows users to select an optimal point based on a utility function that balances selection size and reliability post-hoc.
Results
The experiments conducted on both synthetic and real-world datasets reveal that PH-CS consistently meets user-defined utility constraints while producing reliable FDP estimates. It demonstrates superior performance in maintaining FDR control compared to traditional conformal selection methods.
Implications
The introduction of PH-CS has significant implications for fields requiring flexible candidate selection under uncertainty, such as genomics and neuroimaging. It allows researchers to adapt their selection strategies based on observed data, potentially leading to more effective and efficient decision-making in various applications.
Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Generative Models
Theory
Optimization
- First convergence analysis for transformer-based diffusion models under DDPM loss.
- Quantifies the number of tokens and training iterations needed for convergence.
- Demonstrates that transformers can learn the oracle MMSE estimator for denoising.
- Identifies the impact of data distribution characteristics on convergence.
Read more
Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Summary
This paper presents a theoretical analysis of the training dynamics of transformer-based diffusion models, specifically focusing on the Denoising Diffusion Probabilistic Model (DDPM) objective for data following a multi-token Gaussian mixture distribution. The authors provide the first convergence analysis for training transformers under the DDPM loss, quantifying the necessary number of tokens per data point and training iterations required for global convergence to the Bayes optimal risk. They demonstrate that the self-attention mechanism in transformers effectively implements a mean denoising strategy, allowing the model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator for the injected noise. The findings are supported by numerical experiments, which validate the theoretical results and highlight the model's ability to learn complex data patterns through gradient updates.
Methodology
The authors analyze a one-layer single-head transformer with softmax attention, focusing on the training dynamics and convergence properties when optimizing the DDPM loss. They consider a Multi-Token Gaussian Mixture (MTGM) data distribution and derive theoretical guarantees for the convergence of the model to the optimal denoiser.
Results
The study establishes that the trained transformer can converge to an oracle MMSE estimator, with the oracle denoising risk being close to the true Bayes risk when the number of tokens is sufficiently large. The results indicate that the training process leads to global convergence towards the optimal denoising model, supported by numerical experiments.
Implications
The findings have significant implications for the design and training of transformer-based generative models, particularly in applications requiring high-quality sample generation from complex data distributions. Understanding the convergence dynamics can enhance model performance and reliability in various generative tasks.
PokeRL: Reinforcement Learning for Pokemon Red
Reinforcement Learning
- PokeRL addresses the challenges of sparse rewards and partial observability in Pokemon Red.
- The system incorporates mechanisms to prevent common pitfalls such as action loops and button spamming.
- Training is structured as a curriculum over three specific early-game tasks.
- The environment is designed to enhance agent robustness and transparency in learning.
Read more
PokeRL: Reinforcement Learning for Pokemon Red
Summary
PokeRL presents a modular system designed to train deep reinforcement learning (RL) agents to complete early-game tasks in Pokemon Red, a challenging JRPG characterized by sparse rewards, partial observability, and complex control mechanics. The authors address the limitations of existing RL approaches, particularly those using Proximal Policy Optimization (PPO), which often lead to unproductive behaviors such as action loops and menu spamming. PokeRL introduces a loop-aware environment wrapper around the PyBoy emulator, incorporating map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward structure. The system is structured to train agents through a curriculum focused on three key tasks: exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. By explicitly modeling failure modes and enhancing the training environment, PokeRL aims to provide a more robust platform for RL research in complex game scenarios, moving beyond simple benchmarks to more practical applications.
Methodology
The authors developed a modular system that includes a loop-aware environment wrapper for the PyBoy emulator, which features map masking and a multi-layer anti-loop and anti-spam mechanism. They designed a dense hierarchical reward structure and structured the training process as a curriculum focused on three key tasks, allowing for systematic experimentation and improvement.
Results
PokeRL successfully trains agents to perform non-trivial behaviors in early-game Pokemon Red, demonstrating improved robustness compared to previous methods. The system effectively mitigates issues related to action loops, reward sparsity, and exploration inefficiencies, paving the way for more advanced RL applications in complex game environments.
Implications
The development of PokeRL highlights the importance of explicitly modeling failure modes in RL training environments. This approach can lead to more effective training strategies in other complex domains, potentially influencing future research in reinforcement learning and game AI.