AI-generated summaries

Today's ML research,
without the noise.

Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.

48 Papers today
8h Update frequency
7 Days of history
SolarTformer: A Transformer Based Deep Learning Approach for Short Term Solar Power Forecasting
Ankan Basu, Jyotiraditya Roy, Aditya Datta, Prayas Sanyal, Sumanta Banerjee
Time Series
  • Introduction of SolarTformer, a transformer-based model for solar power forecasting.
  • Utilization of self-attention mechanisms to capture temporal and spatial dependencies.
  • Incorporation of power station-specific metadata to improve generalization.
  • Significant performance improvement over traditional forecasting models.
Read more
GeoCert: Certified Geometric AI for Reliable Forecasting
Regina Zhang, Zongru Li, Honggang Wen, Xiaofeng Liu, Siu-Ming Yiu, Pietro Liò, Kwok-Yan Lam
Time Series Theory Efficient ML
  • GeoCert unifies forecasting, physical reasoning, and formal verification in a single framework.
  • The framework utilizes hyperbolic geometry to ensure robustness and efficient certification.
  • Achieves state-of-the-art accuracy with a 97.5% reduction in computational costs.
  • Empirical results show that verification and predictive performance can be synergistic.
Read more
An Automatic Ground Collision Avoidance System with Reinforcement Learning
Seyyid Osman Sevgili, Atahan Cilan, Mahir Demir, Özgün Can Yürütken, Ümit Can Bekar
Reinforcement Learning Robotics
  • Development of an AI-driven AGCAS for advanced jet trainers.
  • Integration of a digital terrain server and pseudo-lidar for enhanced collision avoidance.
  • Modification of the SAC algorithm with a custom CNN to improve state representation.
  • Creation of a sequential reward function to balance collision avoidance and flight stability.
Read more
Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Weiqiu You, Cassandra Goldberg, Amin Madani, Daniel A. Hashimoto, Eric Wong
Multimodal
  • Introduction of Sum-of-Checks framework for structured surgical safety assessment.
  • Framework improves accuracy and transparency of LVLM-based CVS evaluations.
  • LVLMs show reliable performance on observational checks but variability on anatomical evidence.
  • Explicitly structured decision processes are critical for reliable surgical reasoning.
Read more
A Brain-Inspired Deep Separation Network for Single Channel Raman Spectra Unmixing
Gaoruishu Long, Jinchao Liu, Bo Liu, Jie Liu, Xiaolin Hu
Audio & Speech
  • Introduction of RSSNet, a neural network for single-channel Raman spectrum unmixing.
  • Demonstrated superiority over traditional methods with a performance improvement of over 4 dB.
  • Strong generalization capabilities of RSSNet on real-world mixed spectra.
  • Addresses the limitations of existing sparse regression methods in noisy environments.
Read more
FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records
Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu
Large Language Models Time Series
  • FeatEHR-LLM leverages LLMs to generate clinically meaningful features from irregularly sampled EHR time series.
  • The framework operates at the metadata level to reduce patient privacy exposure.
  • It incorporates a tool-augmented generation mechanism to handle irregular sampling and structural sparsity.
  • The iterative validation-in-the-loop process allows adaptive refinement of generated features.
Read more
Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy
Asim Ukaye, Mubarak Abdu-Aguye, Nurbek Tastan, Karthik Nandakumar
Federated Learning
  • Introduces a data-free contribution estimation signal using von Neumann entropy.
  • Develops a Rank-Adaptive Kalman Filter to stabilize contribution estimates over time.
  • Presents two methods: SpectralFed for direct entropy weighting and SpectralFuse for fusion-based weighting.
  • Demonstrates strong correlation between entropy-derived weights and client performance across multiple datasets.
Read more
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Hillary Mutisya, John Mugane
NLP
  • Neural models can recover historical lexical structures from modern data.
  • BantuMorph v7 identified 728 noun and 1,525 verb cognate candidates across 14 Bantu languages.
  • 90.9% of the top noun candidates align with previously reconstructed Proto-Bantu forms.
  • The model captures phylogenetic relationships consistent with established Guthrie-zone classifications.
Read more
AutoCompress: Critical Layer Isolation for Efficient Transformer Compression
Archit Thorat
NLP Large Language Models Efficient ML
  • Layer 0 in small transformers is disproportionately important, with a 60× higher importance score than other layers.
  • The Critical Layer Isolation (CLI) architecture preserves Layer 0 at full capacity while compressing intermediate layers.
  • CLI-GPT2 achieves significant parameter reduction (59.5%) while maintaining high performance (204.5 perplexity).
  • The performance advantage of CLI is architecture-driven, as demonstrated by ablation studies.
Read more
Robust Fuzzy local k-plane clustering with mixture distance of hinge loss and L1 norm
Junjun Huang, Xiliang Lu, Xuelin Xie, Jerry Zhijian Yang
Computer Vision Optimization Theory
  • Introduction of RFLkPC method to enhance robustness against outliers in fuzzy k-plane clustering.
  • Utilization of a mixture distance of hinge loss and L1 norm for bounded plane clusters.
  • Demonstration of RFLkPC's efficiency through extensive experiments on simulated and real datasets.
  • Public availability of the RFLkPC source code for community use and further research.
Read more
Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks
Vugar Ismailov
Theory
  • A single non-affine function is sufficient to ensure the universality of KANs.
  • Deep KANs with affine edge functions are not universal unless a non-affine function is included.
  • Universality can be maintained with a finite set of affine functions instead of the entire class.
  • KANs with spline-based edge parameterization are confirmed to be universal approximators.
Read more
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
Lucky Verma
NLP Large Language Models Theory
  • Dynamic Tanh (DyT) serves as a regime-dependent implicit regularizer, showing benefits and penalties based on model capacity and data scale.
  • Validation loss improvements with DyT are significant at lower data scales but diminish or reverse at higher scales.
  • Saturation levels of activations provide insight into the performance of DyT, with a proposed heuristic for screening saturation.
  • The effects of DyT are architecture-sensitive, with specific collapse modes identified in Llama models.
Read more
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Divakar Kumar Yadav, Tian Zhao, Deepak Kumar
Efficient ML Optimization Large Language Models
  • CuTile achieves up to 1,007 TFLOP/s for fused attention on Blackwell GPUs, outperforming FlashAttention-2 by 2.5x.
  • CuTile is more efficient than WMMA for GEMM, requiring significantly less code while delivering higher throughput.
  • CuTile's performance is architecture-dependent, with lower throughput on workstation-class GPUs compared to datacenter-class GPUs.
  • Triton demonstrates superior portability, maintaining 62-101% of cuBLAS performance across different architectures.
Read more
A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning
Qishi Zhan, Minxuan Hu, Liang He, Guansu Wang, Jiaxin Liu
Theory
  • Single-seed benchmarks in Bayesian deep learning can be unreliable due to inherent variability in evaluation metrics like CRPS.
  • Variance trajectories differ significantly across methods, with some methods showing pronounced peaks that indicate misestimation risks.
  • Local CRPS variance serves as a direct indicator of single-seed estimation error, while power-law fit quality summarizes method-level evaluation behavior.
  • Modifying the heteroscedastic training objective can help reduce instability in variance learning.
Read more
SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling
Xinrun Wang, Deshun Xia, Ke Xu, Weijie Zhu
Time Series Efficient ML Robotics
  • Introduces a scene-centric paradigm for trajectory prediction, moving away from traditional model-centric approaches.
  • Utilizes unsupervised clustering to create a latent taxonomy of scenes based on motion velocity, spatial density, and interaction patterns.
  • Employs a decoupled classification module for real-time input assignment to scene categories.
  • Demonstrates significant performance improvements over existing methods, with an average accuracy gain of 10.5% across benchmarks.
Read more
An Integrated Framework for Explainable, Fair, and Observable Hospital Readmission Prediction: Development and Validation on MIMIC-IV
Isaac Tosin Adisa
Interpretability
  • Proposes a novel framework for hospital readmission prediction that integrates explainability, fairness, and deployment reliability.
  • Utilizes SHAP for per-patient feature attributions, enhancing interpretability of predictions.
  • Achieves competitive performance with XGBoost (AUC-ROC 0.696) and strong calibration with LightGBM.
  • Ensures demographic fairness across 16 subgroups without post-processing.
Read more
Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes
Tai Xuan Tan, Alexander Mitsos, Eike Cramer
Optimization
  • Introduction of GP-MLMPC for effective NMPC in batch processes.
  • Iterative updates of the GP model enhance control performance with limited initial data.
  • Chance constraints ensure safe operation by quantifying uncertainty.
  • Significant improvements in tracking error and product yield demonstrated in simulations.
Read more
A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning
Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh
Reinforcement Learning Robotics Optimization
  • Introduces a novel perspective on MORL by integrating RFRL concepts.
  • Proposes a preference-guided exploration strategy for effective learning.
  • Demonstrates significant performance improvements over state-of-the-art MORL methods.
  • Highlights the benefits of decoupling environment knowledge from reward information.
Read more
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong
Reinforcement Learning Robotics Generative Models
  • CODA addresses coordination failures in offline MARL by enabling co-adaptation among agents.
  • The method generates synthetic experiences based on the current joint policy using a diffusion model.
  • CODA is compatible with both model-free and model-based offline reinforcement learning algorithms.
  • Empirical results show significant improvements in coordination and performance on standard benchmarks.
Read more
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
Kevin McKee, Thomas Hazy, Yicong Zheng, Zacharie Bugaud, Thomas Miconi
Theory Efficient ML
  • FTN provides structural guarantees against catastrophic forgetting through parameter isolation.
  • The three-stage mask configuration allows for rapid unsupervised task detection.
  • FTN demonstrates effective performance on multiple continual learning benchmarks.
  • The method is inspired by biological neural mechanisms, enhancing its robustness and efficiency.
Read more
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, Heather Miller
Large Language Models Efficient ML NLP
  • Identifies the importance of rarely activated experts in MoE models for downstream tasks.
  • Proposes ExpertCondenser, a novel SFT framework that avoids auxiliary losses and promotes knowledge consolidation.
  • Demonstrates that pruning long-tailed experts leads to performance degradation, emphasizing the need to retain them.
  • Achieves an average performance gain of over 2.5% on key benchmarks compared to state-of-the-art methods.
Read more
Decoding High-Dimensional Finger Motion from EMG Using Riemannian Features and RNNs
Martin Colot, Cédric Simar, Guy Cheron, Ana Maria Cebolla Alvarez, Gianluca Bontempi
Robotics Time Series Multimodal
  • Introduces a novel end-to-end framework for continuous EMG-to-kinematics regression.
  • Develops the Temporal Riemannian Regressor (TRR) model that leverages Riemannian features.
  • Achieves superior performance compared to state-of-the-art methods in both intra- and cross-subject evaluations.
  • Demonstrates real-time deployment capabilities on consumer-grade hardware.
Read more
Towards Adaptive Continual Model Merging via Manifold-Aware Expert Evolution
Haiyun Qiu, Xingyu Wu, Kay Chen Tan
Efficient ML Theory
  • Introduces manifold geometry as a foundation for expert representation and management.
  • Proposes a dynamic expert evolution strategy that balances diversity and architectural parsimony.
  • Develops a data-free and training-free implicit routing mechanism for expert activation.
  • Demonstrates state-of-the-art performance in accuracy and robustness while reducing expert redundancy.
Read more
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw
NLP Large Language Models Theory
  • Verbal confidence is a strong predictor of error detection, surpassing token log-probabilities.
  • PANL activations provide insights into the correctability of answers, independent of behavioral signals.
  • Causal interventions confirm the critical role of PANL in error detection and correction.
  • The study highlights the distinction between first-order and second-order confidence models in LLMs.
Read more
Operational Feature Fingerprints of Graph Datasets via a White-Box Signal-Subspace Probe
Yuchen Xiong, Swee Keong Yeap, Zhen Hong Ban
Graph Learning
  • WG-SRC replaces learned message passing with an explicit graph-signal dictionary, enhancing interpretability.
  • The model serves as both a predictor and a diagnostic tool, revealing operational feature fingerprints.
  • Empirical validation shows WG-SRC's competitive performance against traditional graph baselines.
  • The generated fingerprints guide dataset-specific modifications and interventions.
Read more
Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection
Nicholas R. Rasmussen, Longwei Wang, Rodrigue Rizk, Md Rezwanul Akter Pallab, Samuel Stuwart, Martina Mancini, Arun Singh, KC Santosh
Time Series
  • Introduces a population-aware evaluation framework for EEG biomarkers in PD detection.
  • Demonstrates that traditional models often capture population-specific artifacts, leading to poor generalization.
  • Achieves up to 94.1% accuracy on held-out cohorts through a nested cross-validation approach.
  • Establishes that training on diverse populations enhances biomarker stability and accuracy.
Read more
Can an MLP Absorb Its Own Skip Connection?
Antonij Mijoski, Marko Karbevski
Theory
  • Absorption of skip connections into residual-free MLPs is possible under specific conditions.
  • For certain activation functions (ReLU2, ReGLU), absorption is unconditionally impossible.
  • Gated activations (SwiGLU, GeGLU) also exhibit impossibility for absorption.
  • Ungated ReLU and GELU allow for absorption under specific weight configurations, but this is rare.
Read more
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
Xin Wang, Chi Ma, Shaobin Chen, Pu Wang, Menglei Zhou, Junyi Qiu, Qiaorui Chen, Jiayu Sun, Shijie Liu, Zehuan Wang, Lei Yu, Chuan Liu, Fei Jiang, Wei Lin, Hao Wang, Jiawei Jiang, Xiao Yan
Generative Models Efficient ML Optimization
  • Introduces MTServe, a hierarchical cache management system for generative recommendation models.
  • Addresses the high inference costs associated with processing long user histories.
  • Utilizes both GPU memory and host RAM to optimize cache storage and retrieval.
  • Implements system-level optimizations including a hybrid storage layout and asynchronous data transfer.
Read more
Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs
Jonathan Hoss, Moritz Link, Noah Klarmann
Reinforcement Learning Graph Learning Optimization
  • Introduction of a linear-complexity unified graph representation for JSSP.
  • Feature-based homogenization allows the use of standard homogeneous GNNs without heterogeneous layers.
  • Identification of structural saturation as a critical point for effective scheduling policy training.
  • Demonstration of zero-shot generalization across problem sizes, enhancing scalability.
Read more
Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models
Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso
Optimization Efficient ML
  • HDET allows simultaneous exploration of multiple learning rates across GPU replicas, enhancing optimization.
  • An automatic learning rate controller adapts the learning rate based on inter-replica performance signals.
  • The method requires no additional hyperparameter tuning or changes to existing model architectures.
  • Empirical results show significant improvements in model quality and training speed.
Read more
Protect the Brain When Treating the Heart: A Convolutional Neural Network for Detecting Emboli
Andrea Angino, Ken Trotti, Diego Ulisse Pizzagalli, Rolf Krause, Tiziano Torre, Stefanos Demertzis
Computer Vision
  • Introduction of a 2.5D U-Net architecture for GME detection in echocardiography.
  • Real-time processing capabilities allow for immediate feedback during cardiac procedures.
  • Development of a custom annotation tool to create a dataset for training the model.
  • Demonstrated high accuracy in segmenting GME against a dynamic cardiac background.
Read more
Associativity-Peakiness Metric for Contingency Tables
Naomi E. Zirkind, William J. Diehl
Theory
  • Introduction of the Associativity-Peakiness (AP) metric for evaluating clustering algorithms.
  • AP metric captures critical performance features of contingency tables not addressed by existing metrics.
  • Demonstrated higher dynamic range and computational efficiency of the AP metric through simulations.
  • Enables comparative performance analysis of unsupervised learning algorithms similar to supervised learning metrics.
Read more
Impact of Age Specialized Models for Hypoglycemia Classification
Beyza Cinar, Maria Maleshkova
Time Series
  • Age significantly influences hypoglycemia risk and classification performance in T1D patients.
  • A global population-based model can perform similarly or better than age-segmented models for hypoglycemia classification.
  • Transfer learning can enhance model individualization, particularly for specific age groups.
  • Children benefit most from age-specialized models, highlighting the need for tailored approaches in diabetes management.
Read more
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric
Large Language Models Efficient ML
  • Introduces a sensitivity-driven layer selection strategy for attention modification.
  • Achieves up to 68% higher throughput compared to standard softmax attention.
  • Requires only 10 million tokens for performance recovery after architectural changes.
  • Maintains competitive performance on benchmarks while improving efficiency.
Read more
Deep Learning for Model Calibration in Simulation of Itaconic Acid Production
Daria Fokina, Marco Baldan, Constantin Romankiewicz, Wolfgang Laudensack, Roland Ulber, Michael Bortz
Optimization Generative Models Time Series
  • CFM consistently yields more accurate results than DDL in parameter estimation.
  • CFM provides better generalization and robustness across different operating conditions and scales.
  • The study demonstrates the effectiveness of deep learning in capturing complex relationships in bioprocess modeling.
  • Delay Differential Equations (DDEs) are utilized to account for time delay dynamics in microbial processes.
Read more
Revisiting Neural Activation Coverage for Uncertainty Estimation
Benedikt Franke, Nils Förster, Frank Köster, Asja Fischer, Markus Lange, Arne Raulf
Theory Interpretability Efficient ML
  • NAC is adapted for uncertainty estimation in regression tasks.
  • A new objective function is proposed to compute uncertainty scores for regression models.
  • NAC outperforms traditional methods like Monte Carlo Dropout in terms of meaningful uncertainty scores.
  • The authors provide an easy-to-use implementation of NAC for PyTorch.
Read more
Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
Alex Nikulkov
NLP Large Language Models Reinforcement Learning
  • TCRM transforms intermediate outputs into meaningful predictive signals, improving interpretability.
  • Achieves 44.9% average F1 score on ProcessBench without requiring step-level supervision.
  • Unifies reward and value modeling in PPO, reducing peak GPU memory by 27% and training step time by 19%.
Read more
Optimal sequential decision-making for error propagation mitigation in digital twins
Annice Najafi, Shokoufeh Mirzaei
Reinforcement Learning Optimization Theory
  • Introduces a sequential decision-making framework for error propagation mitigation in digital twins.
  • Develops both MDP and POMDP models based on HMM-derived latent error regimes.
  • Demonstrates that MDP outperforms other intervention policies in terms of cumulative reward and operational efficiency.
  • POMDP recovers most of the MDP performance despite observation noise, emphasizing the importance of information quality.
Read more
Revisable by Design: A Theory of Streaming LLM Agent Execution
Zhiyuan Zhai, Ming Li, Xin Wang
Large Language Models Theory
  • Introduction of the stream paradigm for LLM agent execution, allowing for real-time user revisions.
  • Development of a reversibility taxonomy that classifies agent actions and defines their impact on flexibility.
  • Presentation of the Revision Absorber algorithm, which optimally manages concurrent user interventions.
  • Empirical validation showing the efficiency of the Revision Absorber compared to traditional methods.
Read more
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
Mahdi Kallel, Johannes Tölle, Ahmed Hendawy, Carlo D'Eramo
Reinforcement Learning Computer Vision Efficient ML
  • RIC transforms the classification task from a rigid imitation learning approach to a dynamic, iterative decision-making process.
  • The optimization dynamics of RIC yield a geometrically weighted mixture of per-step log-scores, enhancing calibration and preventing overconfidence.
  • The framework allows for adaptive computation, enabling the model to allocate resources effectively based on the complexity of the input.
  • RIC achieves competitive accuracy compared to standard supervised methods while improving calibration across multiple datasets.
Read more
Machine learning models for estimating counterfactuals in a single-arm inflammatory bowel disease study
Dan Liu, Fida K. Dankar, Jennifer C. deBruyn, Amanda Ricciuto, Anne M. Griffiths, Thomas D. Walters, Khaled EI Emam
Theory
  • Single-arm trials can accelerate study timelines but require alternative methods to estimate treatment effects.
  • Machine learning models can be trained on external control data to predict counterfactual outcomes for treatment arms.
  • Data augmentation using synthetic records improves the performance of ML models significantly.
  • The light gradient boosting machine model provided the best estimates, aligning closely with traditional propensity score matching results.
Read more
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Meghana Karnam, Ananya Joshi
Large Language Models NLP Theory
  • Introduces a statistical framework for multi-agent LLM systems in behavioral health.
  • Implements adaptive sampling based on case complexity to enhance decision-making.
  • Demonstrates significant reduction in false positive rates while maintaining recall.
  • Provides explicit reliability guarantees for AI decisions in safety-critical environments.
Read more
Fast Neural-Network Approximation of Active Target Search Under Uncertainty
Bilal Yousuf, Zsofia Lendek, Lucian Busoniu
Robotics Optimization Efficient ML
  • Introduces a CNN-based approach to approximate Active Search (AS) and Intermittent Active Search (ASI) for target detection.
  • Utilizes a multi-channel grid representation to encode essential information for decision-making.
  • Demonstrates significant reductions in computational costs while maintaining high detection rates.
  • Validates the approach through extensive simulations with varying target distributions.
Read more
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain
Reinforcement Learning Large Language Models Efficient ML
  • Introduces BitRL, the first framework integrating 1-bit quantized LLMs with RL for edge deployment.
  • Achieves significant memory and energy efficiency improvements while retaining high task performance.
  • Provides theoretical insights into quantization effects and convergence in RL.
  • Identifies value estimation as a critical challenge under extreme quantization.
Read more
From Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables
Zongyu Li
Graph Learning Theory Efficient ML
  • L2C framework automatically discovers clusters from local causal patterns.
  • Utilizes a cluster reduction theorem to maintain causal information while reducing cluster size.
  • Handles latent variables effectively without assuming causal sufficiency.
  • Proven to ensure soundness, atomic completeness, and computational efficiency.
Read more
Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song
Optimization Theory
  • Identification of the 'attenuate-then-adapt conflict' in gradient modification under Adam.
  • Demonstration of significant performance collapse in traditional shared-routing methods in continual learning tasks.
  • Introduction of Adaptive Decoupled Moment Routing as a robust solution to mitigate identified failures.
  • Empirical validation of the proposed method across various optimizer configurations and continual learning scenarios.
Read more
Symmetric Equilibrium Propagation for Thermodynamic Diffusion Training
Aditi De
Generative Models Efficient ML Theory
  • Introduces Symmetric Equilibrium Propagation for training diffusion models on analog substrates.
  • Demonstrates a significant energy efficiency improvement over traditional digital methods.
  • Establishes an unbiased estimator for denoising score-matching gradients.
  • Shows that symmetric nudging reduces bias scaling, enhancing training performance.
Read more
Removing Sandbagging in LLMs by Training with Weak Supervision
Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar
Large Language Models Reinforcement Learning NLP
  • Sandbagging in LLMs can lead to underperformance despite high capabilities.
  • Combining supervised fine-tuning (SFT) with reinforcement learning (RL) effectively mitigates sandbagging.
  • Training must be indistinguishable from deployment to elicit true model performance.
  • The study employs an adversarial game setup to evaluate training effectiveness.
Read more