AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Probabilistic Circuits for Irregular Multivariate Time Series Forecasting
Time Series
- CircuITS is a novel architecture that guarantees marginalization consistency in IMTS forecasting.
- The model effectively captures intricate dependencies between time series channels.
- Extensive experiments show that CircuITS outperforms existing models in joint and marginal density estimation.
- The architecture is designed to handle irregular data and generate accurate forecasting queries.
Read more
Probabilistic Circuits for Irregular Multivariate Time Series Forecasting
Summary
This paper introduces CircuITS, a novel architecture designed for forecasting irregular multivariate time series (IMTS) by leveraging probabilistic circuits. The authors highlight the importance of joint probabilistic modeling to accurately quantify uncertainty in IMTS, addressing the limitations of existing models that struggle with marginalization consistency. CircuITS is structured to capture complex dependencies between time series channels while ensuring valid joint distributions. The model employs a hierarchical approach, utilizing sum and product nodes to flexibly represent joint distributions over forecasting queries. Through extensive experiments on four real-world datasets, CircuITS demonstrated superior performance in joint and marginal density estimation compared to state-of-the-art baselines, including ProFITi and MOSES. The results indicate that CircuITS not only resolves issues of marginalization consistency but also establishes a new benchmark in IMTS forecasting.
Methodology
The authors propose CircuITS, which utilizes a hierarchical structure of probabilistic circuits composed of sum nodes and product nodes to model joint distributions. This architecture allows for flexible representation of dependencies and independencies among time series channels. An encoder is introduced to manage irregular data and generate encodings for forecasting queries and circuit weights.
Results
CircuITS outperformed existing models, including MOSES and ProFITi, across all four datasets tested, achieving superior joint and marginal density estimation. The model's ability to maintain marginalization consistency led to more reliable and non-contradictory predictions.
Implications
The development of CircuITS has significant implications for applications requiring accurate forecasting of irregular multivariate time series, such as in finance, healthcare, and environmental monitoring. Its ability to quantify uncertainty and provide consistent predictions can enhance decision-making processes in high-stakes environments.
Toward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach
Graph Learning
Optimization
Theory
- Introduction of a scalable SDN framework for LEO mega-constellations.
- Utilization of GNNs to represent constellation topology and Koopman theory for linearizing dynamics.
- Development of the Graph Koopman Autoencoder (GKAE) for forecasting spatio-temporal behavior.
- Demonstrated improvements in spatial compression (42.8%) and temporal forecasting (10.81%) over existing methods.
Read more
Toward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach
Summary
This paper addresses the challenges of managing large-scale low Earth orbit (LEO) satellite mega-constellations through a novel software-defined networking (SDN) framework that utilizes graph neural networks (GNNs) and Koopman theory. The authors propose a hierarchical architecture that represents the satellite constellation as a dynamic graph, where nodes are satellites and edges are inter-satellite links (ISLs). To enhance scalability, the framework decomposes the constellation into distinct orbital shells, allowing localized processing of topologies. The Graph Koopman Autoencoder (GKAE) is introduced to forecast spatio-temporal behavior within a linear subspace for each shell, effectively transforming complex nonlinear dynamics into a linear representation. This enables long-term predictions with improved stability and interpretability. The central SDN controller aggregates predictions from each shell to coordinate global control. Simulations conducted on the Starlink constellation demonstrate significant improvements in spatial compression and temporal forecasting, showcasing the effectiveness of the proposed approach in managing the complexities of mega-constellations.
Methodology
The authors propose a hierarchical learning framework that decomposes the satellite constellation into orbital shells for spatial scalability and employs a Graph Koopman Autoencoder (GKAE) to model temporal dynamics. The GKAE combines GNNs with Koopman theory to linearize the representation of the network's dynamics, facilitating effective long-term predictions.
Results
The proposed framework achieved at least a 42.8% improvement in spatial compression and a 10.81% improvement in temporal forecasting compared to established baselines, while utilizing a smaller model footprint in simulations of the Starlink constellation.
Implications
This research has significant implications for the management of future satellite networks, enabling more efficient and scalable communication services. The proposed framework could enhance global connectivity, particularly in remote areas, and improve the integration of non-terrestrial networks with terrestrial systems.
Bayesian policy gradient and actor-critic algorithms
Reinforcement Learning
Theory
Robotics
- Introduces a Bayesian framework for policy gradient methods to reduce sample variance.
- Models policy gradients as Gaussian processes, providing uncertainty estimates.
- Develops a new actor-critic model using Bayesian non-parametric critics.
- Demonstrates improved performance over traditional Monte-Carlo based methods.
Read more
Bayesian policy gradient and actor-critic algorithms
Summary
This paper introduces a Bayesian framework for policy gradient methods in reinforcement learning, addressing the high variance associated with conventional Monte-Carlo techniques. By modeling the policy gradient as a Gaussian process, the authors reduce the sample size needed for accurate gradient estimates and provide uncertainty measures through gradient covariance. The framework is adaptable to partially observable problems but does not leverage the Markov property in Markovian systems. To enhance this framework, the authors propose a new actor-critic model that employs a Bayesian class of non-parametric critics, utilizing Gaussian process temporal difference learning to model action-value functions. This allows for the computation of posterior distributions over action-value functions using Bayes' rule. The paper includes extensive experimental comparisons of the proposed Bayesian methods against traditional Monte-Carlo based policy gradient methods, demonstrating their effectiveness across various reinforcement learning tasks.
Methodology
The authors propose a Bayesian approach to policy gradient methods, modeling the gradient as a Gaussian process to minimize sample requirements. They also introduce a novel actor-critic framework that employs Gaussian process temporal difference learning for action-value function estimation, allowing for posterior distribution calculations based on observed data.
Results
The experimental results indicate that the proposed Bayesian policy gradient and actor-critic algorithms outperform traditional Monte-Carlo based methods in terms of convergence speed and accuracy across various reinforcement learning problems.
Implications
The findings suggest that Bayesian methods can significantly enhance the efficiency and reliability of reinforcement learning algorithms, particularly in environments with high variance and uncertainty. This could lead to broader applications in robotics, game playing, and other domains requiring adaptive decision-making.
Deep Kernel Learning for Stratifying Glaucoma Trajectories
Time Series
NLP
Multimodal
- Introduces a hybrid architecture combining clinical-BERT embeddings with a DKL algorithm for predicting glaucoma patient trajectories.
- Successfully identifies three clinically distinct patient subgroups based on risk trajectories rather than current disease state.
- Achieves improved predictive performance compared to standard time-series forecasting methods.
- Provides calibrated uncertainty quantification to aid in clinical decision-making.
Read more
Deep Kernel Learning for Stratifying Glaucoma Trajectories
Summary
This paper addresses the challenge of stratifying patient risk in glaucoma using a novel deep kernel learning (DKL) architecture that integrates a Gaussian Process (GP) backend. The proposed method utilizes a transformer-based feature extractor applied to clinical-BERT embeddings to analyze multimodal electronic health records (EHRs). The model effectively identifies three distinct patient subgroups based on their risk trajectories, revealing that some patients with moderate vision may have a high risk of progression despite better average visual acuity compared to others with stable poor vision. This capability allows for the decoupling of current disease severity from future progression risk, providing a valuable tool for clinical decision support. The authors demonstrate that their method outperforms traditional forecasting models in predicting visual acuity loss and offers calibrated uncertainty estimates, enhancing the management of glaucoma care.
Methodology
The authors developed a deep kernel learning architecture that employs a transformer-based feature extractor to process clinical-BERT embeddings from EHR data. This architecture is designed to handle irregularly sampled, high-dimensional data without relying on imputation methods. The Gaussian Process backend allows for probabilistic forecasting and uncertainty quantification.
Results
The proposed method achieved 53.06% accuracy within 0.1 logMAR on the SOURCE glaucoma dataset, outperforming traditional recurrent neural networks and transformer-based forecasting methods. The model successfully identified three distinct patient trajectories, highlighting the importance of monitoring patients with moderate vision but high trajectory variance.
Implications
The findings suggest that the proposed DKL framework can significantly enhance clinical decision support by identifying high-risk patients based on their progression trajectories. This approach could lead to more targeted interventions and improved management of glaucoma care, and it may be applicable to other chronic conditions characterized by irregular follow-up and heterogeneous trajectories.
Budget Constraints as Riemannian Manifolds
Optimization
Efficient ML
Theory
- Introduction of the budget manifold as a smooth Riemannian submanifold in logit space.
- Development of Riemannian Constrained Optimization (RCO) that enforces budget constraints without hyperparameters.
- Demonstration of the method's effectiveness on synthetic knapsack problems and LLM compression tasks.
- RCO achieves optimal solutions where traditional methods fail, particularly in high-compression scenarios.
Read more
Budget Constraints as Riemannian Manifolds
Summary
This paper addresses the challenge of budget-constrained discrete assignment in machine learning, particularly in contexts like mixed-precision quantization and expert selection. The authors introduce a novel approach that leverages the geometry of budget constraints, conceptualizing them as a smooth Riemannian manifold in logit space. This budget manifold allows for efficient optimization of model loss under strict budget enforcement without the need for sensitive hyperparameters. The proposed Riemannian Constrained Optimization (RCO) method integrates tangent projection, binary-search retraction, and momentum transport around a standard Adam optimizer step. This results in first-order optimization of the actual loss while maintaining exact budget constraints. The paper demonstrates that RCO outperforms traditional penalty methods and evolutionary search techniques in various scenarios, including synthetic knapsack problems and large language model compression tasks, achieving optimal solutions and significant reductions in computational cost.
Methodology
The authors utilize a geometric framework to define budget constraints as a Riemannian manifold. They implement RCO by combining tangent projection, binary-search retraction, and momentum transport with a standard Adam step, enabling efficient optimization under exact budget enforcement. The method also incorporates Gumbel straight-through estimation for discrete feasibility and extends to handle multiple constraints.
Results
The empirical results show that RCO recovers optimal solutions in synthetic knapsack problems, outperforming penalty methods that plateau at 83% of optimal. In large language model compression tasks, RCO matches or exceeds the performance of evolutionary search methods while requiring 3–16 times lower computational resources.
Implications
This work has significant implications for efficient machine learning, particularly in model compression and optimization under constraints. The RCO framework can be applied to various scenarios where budget constraints are critical, potentially leading to more efficient algorithms in resource-constrained environments.
Cross-Subject Generalization for EEG Decoding: A Survey of Deep Learning Methods
Time Series
- High inter-subject variability in EEG signals poses a significant challenge for deep learning models.
- The survey categorizes methodologies into four main families: feature alignment, adversarial learning, feature disentanglement, and contrastive learning.
- A rigorous evaluation framework is essential for assessing the effectiveness of cross-subject generalization methods.
- The paper emphasizes the need to utilize subject-level information to improve model robustness and generalizability.
Read more
Cross-Subject Generalization for EEG Decoding: A Survey of Deep Learning Methods
Summary
This survey addresses the challenge of cross-subject generalization in EEG decoding using deep learning methods. High inter-subject variability in EEG signals creates significant domain shifts between training and testing subjects, leading to performance drops in neural network models. The authors formalize the cross-subject problem as a multi-source domain issue and propose rigorous evaluation protocols for assessment. They categorize existing methodologies into families: feature alignment, adversarial learning, feature disentanglement, and contrastive learning, each designed to leverage subject-specific information to improve generalization. The survey highlights the theoretical limitations of current approaches, the importance of subject identity, and the potential of EEG foundation models for enhancing real-world applications.
Methodology
The authors systematically categorize and analyze deep learning methodologies that address cross-subject generalization in EEG decoding. They formalize the problem as a multi-source domain challenge and propose evaluation protocols. The methodologies discussed include feature alignment, adversarial learning, feature disentanglement, contrastive learning, and meta-learning, each focusing on leveraging subject-specific information to enhance model performance.
Results
The survey provides a comprehensive overview of existing deep learning approaches for EEG decoding, highlighting their strengths and limitations. It identifies critical areas for future research, including the need for better theoretical frameworks and the integration of subject identity into model training.
Implications
The findings of this survey have significant implications for the development of more robust EEG decoding systems that can generalize across subjects. Improved methodologies could enhance applications in clinical diagnostics, brain-computer interfaces, and cognitive state analysis, ultimately leading to better patient outcomes and more effective technology.
Statistical Channel Fingerprint Construction for Massive MIMO: A Unified Tensor Learning Framework
Theory
Optimization
Efficient ML
- Introduction of statistical channel fingerprints (sCF) for massive MIMO systems.
- Unified tensor representation and dimension reduction of sCF using eigenvalue decomposition.
- Development of LPWTNet architecture for efficient inference and multi-scale frequency capture.
- Implementation of shared mask learning for adaptive refinement of sCF components.
Read more
Statistical Channel Fingerprint Construction for Massive MIMO: A Unified Tensor Learning Framework
Summary
This paper presents a novel approach to constructing statistical channel fingerprints (sCF) for massive MIMO communication systems, which are essential for acquiring channel state information (CSI). The authors establish a relationship between statistical CSI, represented by the channel spatial covariance matrix (CSCM), and the channel power angular spectrum (CPAS). They propose a unified tensor representation of the sCF, which is dimensionally reduced using eigenvalue decomposition of the CSCM and its correlation with the PAS. The proposed method, LPWTNet, utilizes a Laplacian pyramid decomposition and reconstruction framework, enhancing inference efficiency while capturing multi-scale frequency characteristics of the sCF. A shared mask learning strategy is introduced to refine high-frequency components adaptively. Additionally, a small-kernel convolution mechanism based on wavelet transform is proposed to improve feature extraction efficiency. The extensive experiments conducted demonstrate that LPWTNet achieves competitive reconstruction accuracy and computational efficiency compared to existing state-of-the-art methods across various sCF construction scenarios.
Methodology
The authors construct a unified tensor representation of the statistical channel fingerprint (sCF) and reduce its dimensionality through eigenvalue decomposition of the channel spatial covariance matrix (CSCM). They propose a tensor-based learning architecture, LPWTNet, which incorporates a Laplacian pyramid decomposition and reconstruction framework, alongside a shared mask learning strategy and a small-kernel convolution mechanism based on wavelet transform.
Results
The proposed LPWTNet architecture shows competitive reconstruction accuracy and computational efficiency in various scenarios for sCF construction, outperforming several state-of-the-art baselines.
Implications
The findings suggest that the proposed framework can significantly enhance the efficiency of channel state information acquisition in massive MIMO systems, which is crucial for the development of next-generation communication networks, such as 6G.
Bayesian Optimization in Linear Time
Optimization
- TreeBO reduces the computational complexity of Bayesian optimization from cubic to linear.
- The method improves the balance between local and global modeling of objective functions.
- Empirical results show superior optimization performance on seven test functions.
- TreeBO is simpler to tune than existing partitioning methods, requiring only one additional hyperparameter.
Read more
Bayesian Optimization in Linear Time
Summary
This paper addresses the limitations of standard Bayesian optimization, which suffers from cubic computational complexity and suboptimal global modeling for local minimization tasks. The authors propose a novel method called TreeBO that employs flexible and recursive binary partitioning of the search space. This approach allows for the adaptation of both the modeling and acquisition aspects of Bayesian optimization, significantly improving computational efficiency and optimization performance. By clustering and using binary classification, TreeBO creates a binary tree structure where each node represents a subregion of the domain, allowing for localized modeling of the objective function. The authors empirically validate their method against a widely used Bayesian optimization library, demonstrating superior performance across seven challenging test functions with varying dimensionalities. The results indicate that TreeBO not only reduces computational complexity from cubic to linear but also enhances the balance between local and global modeling, leading to faster and more effective optimization.
Methodology
The authors utilize a recursive binary partitioning approach to divide the search space into subregions, each modeled by its own Gaussian process. This allows for localized modeling and acquisition of observations, which is optimized through clustering and binary classification techniques. The method adapts the standard Bayesian optimization framework to work in harmony with the partitioning scheme, addressing its inherent shortcomings.
Results
TreeBO outperformed the standard Bayesian optimization library DiceOptim on all seven test functions, which included high-dimensional challenges. The method demonstrated significant reductions in runtime, particularly for the most complex optimization problems, while maintaining or improving optimization performance.
Implications
The findings suggest that TreeBO can be effectively applied to various optimization problems in machine learning, engineering, and other fields where objective functions are expensive to evaluate. Its linear computational complexity makes it suitable for high-dimensional optimization tasks that were previously computationally prohibitive.
People-Centred Medical Image Analysis
Computer Vision
- PecMan framework integrates AI fairness, L2D, and L2C to improve diagnostic accuracy and equity.
- Introduces the FairHAI benchmark for evaluating AI systems based on accuracy, fairness, and clinician workload.
- Demonstrates that addressing fairness and workflow integration together leads to better clinical outcomes.
- Experimental results show PecMan outperforms existing isolated approaches in medical image analysis.
Read more
People-Centred Medical Image Analysis
Summary
The paper addresses the challenges of integrating AI into clinical workflows for medical image analysis, emphasizing the need for fair performance across diverse patient populations and seamless workflow integration. The authors propose a novel framework called People-Centred Medical Image Analysis (PecMan), which optimizes diagnostic accuracy, fairness, and workflow effectiveness through a dynamic gating mechanism that allocates cases to AI, clinicians, or both, based on clinician workload constraints. The framework integrates concepts from AI fairness and human-AI collaboration, specifically Learning to Defer (L2D) and Learning to Complement (L2C), which have typically been studied in isolation. Additionally, the authors introduce the Fairness and Human-Centred AI (FairHAI) benchmark, designed to evaluate the trade-offs between accuracy, fairness, and clinician workload. Experimental results demonstrate that PecMan consistently outperforms existing methods, highlighting its potential to enhance the reliability and acceptance of AI systems in clinical practice.
Methodology
The authors developed the PecMan framework, which employs a dynamic gating mechanism to assign cases to AI, clinicians, or both, based on workload constraints. The framework trains multiple group-specific AI models and combines their outputs to optimize diagnostic accuracy and fairness. The FairHAI benchmark was created to assess the performance of AI systems in terms of accuracy, fairness, and clinician workload.
Results
Experimental evaluations using the FairHAI benchmark indicated that PecMan significantly outperformed traditional methods that address AI fairness, L2D, and L2C in isolation, resulting in improved diagnostic accuracy and reduced performance disparities across patient groups.
Implications
The findings suggest that integrating fairness and workflow considerations in AI systems can enhance their clinical viability and acceptance, ultimately leading to better patient care. The PecMan framework could serve as a model for future AI implementations in healthcare, addressing both ethical and practical challenges.
AMGenC: Generating Charge Balanced Amorphous Materials
Generative Models
Optimization
- AMGENC guarantees the generation of charge balanced amorphous materials.
- The method introduces a novel approach combining element noise, soft projections, and discrete projections.
- AMGENC reduces the time to obtain charge balanced samples by up to two orders of magnitude compared to existing methods.
- Extensive experiments validate the effectiveness and accuracy of AMGENC across multiple configurations.
Read more
AMGenC: Generating Charge Balanced Amorphous Materials
Summary
The paper presents AMGENC, a novel generative inverse design method specifically aimed at generating charge balanced amorphous materials. Amorphous materials, which lack a periodic atomic structure, are increasingly important in various applications such as energy storage and thermal management. Traditional methods for designing these materials often rely on trial-and-error approaches, which are resource-intensive and inefficient. AMGENC addresses a critical limitation of existing generative models: the inability to enforce charge balance in generated samples. The proposed method incorporates an optimal-transport coupled element noise to initiate the generation process around charge balance, a per-step soft projection to guide the elements toward charge balance during generation, and a final discrete projection to correct any remaining charge imbalance. Extensive experiments on two datasets demonstrate that AMGENC not only guarantees charge balanced outputs but also matches or surpasses the accuracy of existing methods while significantly reducing the computational time required to obtain charge balanced samples.
Methodology
AMGENC employs a flow-matching-based generative design approach that integrates three main components: an optimal-transport coupled element noise to center the generation around charge balance, a per-step soft Gauss-Newton projection to guide the elements toward charge balance, and a final discrete projection to resolve any residual charge imbalance through dynamic programming.
Results
The experimental results indicate that AMGENC successfully generates charge balanced samples while achieving or exceeding the inverse design accuracy of existing methods. The method significantly reduces the computational resources needed to generate charge balanced materials, demonstrating its efficiency.
Implications
The development of AMGENC has significant implications for the design of amorphous materials in various fields, including energy storage and advanced materials. By enabling efficient generation of charge balanced materials, this method can facilitate the exploration of new material compositions and properties, potentially leading to innovative applications.
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
Reinforcement Learning
Theory
Optimization
- KL regularization serves as an effective alternative to explicit pessimism in offline learning.
- The General-sum Anchored Nash Equilibrium (GANE) achieves an accelerated statistical rate of O(1/n) for Nash equilibria.
- The General-sum Anchored Mirror Descent (GAMD) algorithm provides a computationally efficient method for recovering Coarse Correlated Equilibria.
- The proposed methods eliminate the need for complex hyperparameter tuning associated with traditional pessimistic approaches.
Read more
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
Summary
This paper addresses the challenges of offline multi-agent reinforcement learning in general-sum games, particularly the distribution shift between logged datasets and target equilibrium policies. Traditional methods often rely on manual pessimistic penalties to mitigate this issue. The authors propose using KL regularization as a standalone mechanism to stabilize learning and achieve equilibrium recovery without the need for explicit pessimism. They introduce the General-sum Anchored Nash Equilibrium (GANE), which allows for the recovery of regularized Nash equilibria at an accelerated statistical rate of O(1/n). To facilitate computational efficiency, they develop the General-sum Anchored Mirror Descent (GAMD) algorithm, which converges to a Coarse Correlated Equilibrium at a standard rate of O(1/√n + 1/T). The findings suggest that KL regularization can effectively replace traditional pessimistic approaches, leading to improved statistical efficiency and tractability in multi-player general-sum games.
Methodology
The authors leverage KL regularization to stabilize offline learning in general-sum games, introducing the GANE framework for Nash equilibrium recovery and the GAMD algorithm for computational efficiency. They analyze the statistical rates of these methods and demonstrate their effectiveness through theoretical proofs.
Results
The paper establishes that KL regularization can independently stabilize learning in general-sum games, achieving an O(1/n) statistical rate for Nash equilibria and an O(1/√n + 1/T) rate for Coarse Correlated Equilibria through the GAMD algorithm. These results indicate a significant improvement over traditional pessimistic methods.
Implications
The findings have potential applications in various strategic environments, including economic markets and multi-agent coordination tasks, where safe exploration is crucial. The proposed methods can enhance decision-making policies derived from offline datasets, making them more robust and efficient.
Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation
Reinforcement Learning
Theory
Efficient ML
- Introduces a novel algorithm for offline oracle-efficient episodic reinforcement learning.
- Achieves optimal regret bounds with significantly reduced oracle call complexity.
- Generalizes the approach to linear MDPs with infinite state and action spaces.
- Demonstrates the first doubly oracle-efficient regret minimization algorithm for MDPs.
Read more
Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation
Summary
This paper addresses the computational challenges faced by model-based reinforcement learning (RL) in large environments, particularly the inefficiencies associated with traditional regret minimization algorithms that require extensive calls to planning and statistical estimation oracles. The authors propose a novel algorithm for offline oracle-efficient episodic RL that utilizes log-barrier and log-determinant regularization. This algorithm achieves an optimal regret bound while significantly reducing the number of required oracle calls, making it independent of the size of the state and action spaces. The proposed method is applicable to tabular Markov Decision Processes (MDPs) and is further generalized to linear MDPs with infinite state and action spaces, demonstrating the ability to achieve meaningful sub-linear regret. This work represents a significant advancement in the field by providing the first doubly oracle-efficient algorithm for MDPs, thus expanding the computational tractability of RL in complex environments.
Methodology
The authors utilize log-barrier and log-determinant regularization techniques to develop an algorithm that minimizes regret in offline episodic reinforcement learning. The algorithm is designed to require fewer calls to statistical estimation and planning oracles, achieving independence from the size of state and action spaces. The methodology is validated through theoretical proofs and generalizations to more complex MDP structures.
Results
The proposed algorithm achieves an optimal regret bound of O(T) while requiring only O(H log log T) calls to both offline statistical estimation and planning oracles when T is known, and O(H log T) calls when T is unknown. The generalization to linear MDPs with infinite state and action spaces also results in meaningful sub-linear regret, showcasing the algorithm's versatility and efficiency.
Implications
This research has significant implications for the deployment of reinforcement learning in large and complex environments, such as robotics, autonomous systems, and operations research. By improving computational efficiency, the proposed methods can facilitate the practical application of RL in real-world scenarios where traditional methods are computationally prohibitive.
A Comparative Study of QSPR Methods on a Unique Multitask PAMPA dataset
Theory
Interpretability
- Introduces a unique multitask dataset of 143 drug molecules evaluated across six PAMPA setups.
- Compares various QSPR methods, highlighting the effectiveness of traditional descriptors over deep learning models for small datasets.
- Focuses on the balance between predictive accuracy and model interpretability in drug permeability predictions.
- Provides novel insights into membrane-specific permeability profiles, aiding in drug discovery processes.
Read more
A Comparative Study of QSPR Methods on a Unique Multitask PAMPA dataset
Summary
This paper presents a comprehensive study on a unique multitask dataset consisting of 143 drug and drug candidate molecules evaluated through in vitro Parallel Artificial Membrane Permeability Assays (PAMPA) using six different model membranes. The authors systematically assess various molecular descriptors and regression models, ranging from simple linear regression to advanced pre-trained transformer architectures, to predict passive membrane permeability. The study emphasizes the trade-off between predictive performance and model interpretability, revealing that expert-designed physico-chemical property descriptors are more effective for limited sample sizes compared to deep learning representations. This research is notable for being the most extensive analysis of multiple organ-specific PAMPA membranes to date, providing new insights into membrane-specific permeability profiles and enhancing the understanding of drug absorption mechanisms in early drug discovery.
Methodology
The study employs a comparative analysis of various QSPR methods, utilizing a dataset derived from PAMPA assays. It evaluates multiple molecular descriptors and regression models, including linear regression and advanced machine learning techniques, to predict passive membrane permeability. The models are rigorously validated through internal and external validation techniques to ensure robustness and generalizability.
Results
The findings indicate that expert-designed physico-chemical property descriptors yield better predictive performance for the limited sample size of the permeability study compared to deep learning-based representations. The study also highlights the challenges associated with machine learning approaches in terms of interpretability and predictive accuracy.
Implications
The insights gained from this study can significantly impact the drug discovery process by improving the prediction of drug permeability across different biological membranes. This can lead to more effective drug design and development strategies, ultimately enhancing therapeutic outcomes.
Weisfeiler Lehman Test on Combinatorial Complexes: Generalized Expressive Power of Topological Neural Networks
Graph Learning
Theory
Efficient ML
- Introduction of the Combinatorial Complex Weisfeiler-Lehman (CCWL) test for topological neural networks.
- Establishment of a unified theoretical framework for topological message passing across various combinatorial structures.
- Proof that upper and lower neighborhood relations suffice for full expressivity in the CCWL framework.
- Development of the Combinatorial Complex Isomorphism Network (CCIN) that outperforms existing methods.
Read more
Weisfeiler Lehman Test on Combinatorial Complexes: Generalized Expressive Power of Topological Neural Networks
Summary
This paper introduces the Combinatorial Complex Weisfeiler-Lehman (CCWL) test, an extension of the Weisfeiler-Lehman (WL) test tailored for combinatorial complexes, which unify various topological structures such as graphs, hypergraphs, and simplicial complexes. The authors argue that existing topological neural networks lack a cohesive theoretical foundation, leading to fragmented approaches that do not adequately capture the expressive power of higher-order structures. The CCWL test formalizes topological message passing through four types of neighborhood relations, providing a comprehensive framework for understanding the expressivity of these structures. The authors prove that only upper and lower neighborhood relations are necessary to achieve the full expressivity of the CCWL framework. Additionally, they propose the Combinatorial Complex Isomorphism Network (CCIN), which is evaluated against synthetic and real-world datasets, demonstrating superior performance compared to baseline methods. This work addresses significant gaps in the literature regarding topological message passing and lays the groundwork for future research in topological deep learning.
Methodology
The authors developed the CCWL test as an axiomatic extension of the WL test, incorporating four neighborhood relations to facilitate topological message passing. They conducted theoretical analyses to establish the expressivity of the CCWL framework and proposed the CCIN architecture, which was empirically evaluated on both synthetic and real-world datasets.
Results
The CCIN demonstrated improved performance over baseline methods in various benchmarks, confirming the effectiveness of the CCWL framework in capturing the expressive power of combinatorial complexes. The theoretical results provided insights into the necessary conditions for achieving expressivity in topological neural networks.
Implications
This work has significant implications for the field of topological deep learning, offering a unified framework that can enhance the modeling of complex data structures in various applications, including social networks, biological systems, and geometric modeling. It also sets the stage for further exploration of higher-order interactions in machine learning.
Learning Rate Transfer in Normalized Transformers
Optimization
Theory
Efficient ML
- Introduction of νGPT, a novel parameterization for Normalized Transformers.
- Demonstrates effective learning rate transfer across model width, depth, and token horizon.
- Empirical validation shows no performance loss compared to the original nGPT.
- Utilizes alignment exponents to refine hyperparameter transfer techniques.
Read more
Learning Rate Transfer in Normalized Transformers
Summary
This paper addresses the challenge of learning rate transfer in Normalized Transformers (nGPT), which have shown remarkable training speedups and do not require weight decay or learning rate warmup. Despite these advantages, the authors found that nGPT does not exhibit effective learning rate transfer across different model dimensions and token horizons. To overcome this limitation, they introduce a new parameterization called νGPT, which is developed through a combination of numerical experiments and theoretical insights from alignment exponents. The νGPT model allows for effective learning rate transfer across model width, depth, and token horizon, demonstrating improved performance over the original nGPT without any loss in effectiveness. The authors validate their findings through extensive empirical testing, confirming that νGPT maintains stability and performance while enabling hyperparameter transfer across various model configurations.
Methodology
The authors conducted numerical experiments and theoretical analyses to develop the νGPT parameterization. They employed alignment exponents to understand weight-activation relationships in nGPT and modified existing hyperparameter transfer techniques to enhance learning rate transfer across various model configurations.
Results
The experiments revealed that νGPT successfully transfers learning rates across model width and depth, showing better performance than the original µP approach. The optimal learning rate was found to scale with the token horizon as #tokens^(-1/3), consistent with previous findings for un-normalized Transformers. Overall, νGPT achieved hyperparameter transfer without degrading performance compared to well-tuned nGPT models.
Implications
The findings suggest that νGPT can facilitate more efficient training of large-scale models by allowing practitioners to leverage hyperparameter settings from smaller models, potentially reducing the computational burden associated with hyperparameter tuning. This advancement could lead to faster development cycles and improved performance in various applications of Normalized Transformers.
Co-Evolving Policy Distillation
Reinforcement Learning
Multimodal
- Identifies limitations of traditional RLVR and OPD methods due to behavioral distance between teacher and student models.
- Proposes CoPD, which interleaves RLVR and mutual OPD for continuous co-evolution of expert models.
- Demonstrates that CoPD outperforms existing methods in multimodal reasoning tasks.
- Establishes that maintaining behavioral proximity enhances knowledge transfer during training.
Read more
Co-Evolving Policy Distillation
Summary
The paper introduces Co-Evolving Policy Distillation (CoPD), a novel approach that addresses the limitations of existing post-training paradigms in reinforcement learning, specifically the mixed RLVR and OPD methods. The authors identify a critical issue where the behavioral distance between teacher and student models increases during the static OPD phase, leading to ineffective knowledge transfer. CoPD proposes a unified training framework where multiple expert models are trained in parallel, allowing them to serve as mutual teachers throughout the training process. This co-evolutionary approach maintains a closer behavioral alignment between models, facilitating better knowledge absorption. The methodology involves alternating phases of reinforcement learning with verifiable rewards (RLVR) and mutual OPD, ensuring that each model continuously learns while also teaching others. Experimental results demonstrate that CoPD significantly outperforms traditional methods across various tasks, including text, image, and video reasoning, achieving superior integration of multimodal capabilities. The findings suggest that CoPD could inspire new paradigms for scaling model training effectively.
Methodology
CoPD employs a dual-phase training approach where reinforcement learning is conducted alongside mutual on-policy distillation. Each expert model is trained on its specific capability while simultaneously providing and receiving knowledge from other models, ensuring they remain behaviorally aligned throughout the training process.
Results
CoPD consistently outperformed mixed RLVR and static OPD methods across various benchmarks, including text reasoning, image-text reasoning, and video understanding tasks. The results indicate that the co-evolutionary training framework leads to better integration of capabilities and surpasses the performance of domain-specific expert models.
Implications
The findings suggest that CoPD could reshape how multimodal models are trained, allowing for more efficient knowledge transfer and integration of diverse capabilities. This approach may lead to advancements in various applications, including AI systems that require robust reasoning across multiple modalities.
BoostLoRA: Growing Effective Rank by Boosting Adapters
NLP
Large Language Models
Efficient ML
- BoostLoRA allows for linear growth of effective rank through iterative training and merging of ultra-low-parameter adapters.
- The ROTATE SVD basis strategy ensures that each adapter operates in an orthogonal subspace, enhancing model expressivity.
- BoostLoRA achieves superior performance on benchmarks like GSM8K and MATH-500 compared to existing methods.
- The framework maintains zero inference overhead by discarding merged adapters after training.
Read more
BoostLoRA: Growing Effective Rank by Boosting Adapters
Summary
The paper introduces BoostLoRA, a novel parameter-efficient fine-tuning (PEFT) framework that addresses the limitations of ultra-low-parameter adapters in terms of expressivity. Traditional methods, such as TinyLoRA, face a tradeoff between adapter size and performance, as they are confined to fixed low-rank subspaces. BoostLoRA employs a gradient-boosting approach, iteratively training and merging minimal adapters on examples where the current model fails. This method utilizes a ROTATE SVD basis strategy to assign each training round to an orthogonal subspace, allowing the cumulative effective rank to grow linearly with the number of rounds while maintaining ultra-low rank for each adapter. The framework discards merged adapters post-training, resulting in zero inference overhead. Experimental results demonstrate that BoostLoRA outperforms both the best single-shot ultra-low-parameter adapter and full fine-tuning across various tasks, including math problem-solving and code generation. Additionally, it shows promise in cross-architecture transfer tasks, indicating its versatility and effectiveness in enhancing model performance without significant increases in parameter count.
Methodology
BoostLoRA employs a gradient-boosting framework to iteratively train TinyLoRA adapters on misclassified examples. Each adapter is assigned to an orthogonal subspace using a ROTATE SVD basis strategy, allowing for cumulative effective rank growth while keeping individual adapters ultra-low-rank. After training, adapters are merged into the base weights and discarded to maintain efficiency during inference.
Results
BoostLoRA achieved 89.1% accuracy on GSM8K and 68.8% on MATH-500, surpassing TinyLoRA's best performance of 87.2% and full fine-tuning at 87.0%. In code generation tasks, it reached 57.2% on MBPP and 80.4% on HumanEval. The method also outperformed TinyLoRA in protein binding classification tasks across various parameter scales.
Implications
BoostLoRA's approach to growing effective rank during training could significantly enhance the performance of large language models in various applications, including math problem-solving, code generation, and potentially other domains requiring efficient model adaptation. Its ability to maintain low parameter counts while improving expressivity may lead to more accessible and efficient model fine-tuning strategies.
Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise
Theory
Optimization
- Introduces history-adaptive virtual perturbations for SGD analysis.
- Replaces fixed perturbation geometries with adaptive covariances based on past optimization history.
- Establishes information-theoretic generalization bounds that account for dynamic optimization processes.
- Demonstrates that the framework can recover existing bounds as special cases.
Read more
Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise
Summary
This paper presents a novel framework for analyzing stochastic gradient descent (SGD) through information-theoretic generalization bounds that incorporate predictable history-adaptive virtual perturbations. Traditional approaches to bounding the generalization error in SGD often rely on fixed perturbation geometries, which do not account for the dynamic nature of optimization processes. The author introduces a method where the perturbation covariance at each iteration can depend on the past SGD history, allowing for a more flexible and accurate representation of the optimization landscape. This predictability condition facilitates a conditional Gaussian relative-entropy argument, leading to bounds that replace fixed local gradient-deviation and gradient-sensitivity quantities with adaptive counterparts. The framework also includes an adaptive output-sensitivity penalty based on the accumulated perturbation covariance. The results show that the proposed bounds can recover fixed isotropic and geometry-aware virtual perturbation bounds as special cases, while extending the analysis to history-dependent stochastic optimization without altering the underlying SGD algorithm.
Methodology
The paper employs an information-theoretic approach to derive generalization bounds for SGD by introducing predictable history-adaptive virtual perturbations. This involves using conditional Gaussian relative-entropy arguments to analyze the relationship between learned parameters and training data, while allowing the perturbation covariance to adapt based on the optimization history.
Results
The proposed framework yields generalization bounds that are more representative of the actual optimization trajectory compared to traditional fixed-noise approaches. It successfully incorporates adaptive perturbation geometries, leading to improved insights into the generalization behavior of SGD in complex learning scenarios.
Implications
The findings suggest that incorporating history-dependent perturbations can lead to better theoretical understanding and practical implementations of SGD in machine learning, particularly in high-dimensional and nonconvex optimization problems. This could enhance the performance of various learning systems that rely on SGD.
Online semi-supervised perception: Real-time learning without explicit feedback
Computer Vision
Graph Learning
Theory
- Proposes a novel algorithm for real-time learning without explicit feedback.
- Combines semi-supervised learning on graphs with online learning techniques.
- Demonstrates superior performance in real-time face recognition tasks.
- Establishes a regret bound for the quality of solutions provided by the algorithm.
Read more
Online semi-supervised perception: Real-time learning without explicit feedback
Summary
This paper introduces an innovative algorithm for real-time learning that operates without explicit feedback, merging concepts from semi-supervised learning on graphs and online learning. The algorithm constructs a graphical representation of its environment and updates it iteratively with observed examples. Initially, labeled examples provide a bias, while a continuous stream of unlabeled examples is collected online to refine this bias. The authors demonstrate the algorithm's efficacy by applying it to the challenging task of real-time face recognition, achieving superior precision and recall across three distinct video datasets. The paper also discusses the efficient implementation of the algorithm, establishes a regret bound on the quality of its solutions, and emphasizes the algorithm's adaptability to changing data over time. The main contribution lies in showcasing the potential of unlabeled data in enhancing learning algorithms for real-world applications, particularly in the domain of face recognition.
Methodology
The proposed algorithm iteratively builds a graphical representation of the environment, utilizing a harmonic function solution on the data adjacency graph to infer labels for unlabeled examples. The algorithm begins with a set of labeled examples and continuously updates its bias using a stream of unlabeled data. The implementation is designed to be efficient, allowing for real-time processing and adaptation to changes in the data manifold.
Results
The algorithm was empirically evaluated on three challenging video datasets for face recognition, achieving superior precision and recall compared to existing methods. The results validate the effectiveness of leveraging unlabeled data in enhancing the performance of the face recognizer.
Implications
The findings suggest that the proposed approach can significantly improve adaptive machine learning algorithms in real-world scenarios where labeled data is scarce. The ability to learn from unlabeled data in real-time opens new avenues for applications in various domains, including security, surveillance, and human-computer interaction.
Trading off rewards and errors in multi-armed bandits
Reinforcement Learning
Theory
Optimization
- Introduces a new objective function for balancing rewards and estimation errors in MAB settings.
- Develops the ForcingBalance algorithm, which optimizes the proposed objective function.
- Proves that ForcingBalance achieves asymptotic regret rates comparable to the best strategies for both cumulative reward and active exploration.
- Demonstrates the algorithm's effectiveness through empirical simulations on educational data.
Read more
Trading off rewards and errors in multi-armed bandits
Summary
This paper addresses the challenge of balancing cumulative reward maximization and accurate estimation of arm values in multi-armed bandit (MAB) problems. Traditional MAB objectives often focus solely on maximizing rewards or minimizing estimation errors, but in many practical scenarios, such as educational games, it is crucial to consider both aspects simultaneously. The authors formalize this trade-off and introduce the ForcingBalance algorithm, which is designed to optimize a new objective function that integrates both rewards and estimation errors. The paper demonstrates that ForcingBalance achieves a regret that asymptotically matches the best possible rates for both cumulative regret minimization and active exploration, indicating that balancing these objectives is not inherently more difficult than optimizing for either one alone. Empirical results on real-world educational data validate the effectiveness of the ForcingBalance algorithm, showing that it can provide valuable insights without compromising overall rewards.
Methodology
The authors propose a new objective function that allows designers to weigh the trade-off between cumulative rewards and estimation errors. They introduce the ForcingBalance algorithm, which operates under the assumption that arm distributions are unknown. The algorithm is analyzed for its regret performance, showing that it can achieve optimal rates for both objectives. The analysis relies on properties of strong convexity and smoothness of the objective function, making the approach extensible.
Results
The ForcingBalance algorithm was shown to incur a regret that asymptotically matches the minimax rate for cumulative regret minimization and the performance of active exploration algorithms. Empirical simulations on both synthetic and educational datasets corroborated the theoretical findings, demonstrating that the algorithm effectively balances the trade-off between rewards and errors.
Implications
The findings suggest that the ForcingBalance algorithm can be applied in various domains where both user satisfaction and accurate data collection are critical, such as education and healthcare. This approach can enhance the design of interactive systems that require a balance between providing immediate rewards to users and gathering reliable information for future decision-making.
Scalable Context-Aware Graph Attention for Unsupervised Anomaly Detection in Large-Scale Mobile Networks
Graph Learning
Time Series
- Introduction of C-MTAD-GAT, a centralized context-aware anomaly detection system for telecom networks.
- Development of a domain-agnostic calibration protocol based solely on validation errors.
- Validation across multiple datasets, including TELCO, RAN, and EPC control-plane data.
- Demonstration of scalability and stability in anomaly detection as the number of network elements increases.
Read more
Scalable Context-Aware Graph Attention for Unsupervised Anomaly Detection in Large-Scale Mobile Networks
Summary
This paper presents C-MTAD-GAT, a novel framework for unsupervised anomaly detection in large-scale mobile networks, addressing the challenges posed by high-dimensional KPI time series from heterogeneous network elements. The authors highlight the impracticality of supervised approaches due to the scale and cost of incident labeling, thus motivating the need for a robust unsupervised method. C-MTAD-GAT integrates temporal and feature-wise graph attention with lightweight static and dynamic context conditioning, along with a dual-head decoder for reconstruction and multi-step forecasting. The model generates per-element, per-feature anomaly scores, which are converted to alerts using fully unsupervised thresholds based on validation residuals. The framework was evaluated on the TELCO dataset and demonstrated improvements in event-level affiliation and pointwise F1 scores while generating fewer alarms compared to existing baselines. Furthermore, the model was successfully applied to real-world datasets from a national mobile network operator, receiving positive feedback for its actionable alerts and scalability across different domains without relying on labeled incidents.
Methodology
C-MTAD-GAT employs a context-aware graph attention mechanism that incorporates both static and dynamic metadata for anomaly detection. The model is designed to operate as a single shared architecture across diverse network elements, producing anomaly scores that are calibrated without the need for labeled data. It utilizes a dual-head decoder for both reconstruction and forecasting tasks, enhancing its ability to detect anomalies in multivariate time series data.
Results
The C-MTAD-GAT framework achieved superior performance on the TELCO dataset, improving event-level affiliation and pointwise F1 scores while reducing the number of false alarms compared to previous graph-attention and VAE-based methods. In practical deployment within a national mobile network, the model provided actionable alerts that supported daily monitoring, demonstrating its effectiveness and scalability.
Implications
The proposed framework has significant implications for mobile network operators, allowing for efficient and scalable anomaly detection across large and heterogeneous network infrastructures. Its unsupervised nature reduces the dependency on costly labeled data, making it a practical solution for real-time monitoring and incident response in telecom environments.
Physical Foundation Models: Fixed hardware implementations of large-scale neural networks
Efficient ML
Large Language Models
Theory
- PFMs could significantly reduce energy consumption and improve performance for large-scale AI models.
- The paper advocates for hardware implementations that utilize the physical properties of materials for computation.
- PFMs may enable the deployment of AI models with parameter counts reaching 10^18.
- The authors discuss the challenges and open questions in realizing PFMs in practical applications.
Read more
Physical Foundation Models: Fixed hardware implementations of large-scale neural networks
Summary
This paper discusses the concept of Physical Foundation Models (PFMs), which are specialized hardware implementations of large-scale neural networks designed to enhance energy efficiency, speed, and parameter density. The authors argue that the rise of foundation models, such as GPT-5 and Gemini 3, presents a unique opportunity for hardware engineers to create fixed hardware solutions that can be manufactured and released in sync with advancements in AI models. Unlike traditional digital-electronic implementations, PFMs leverage the natural physical dynamics of materials to perform computations directly, potentially achieving significant improvements in performance and energy consumption. The paper highlights the challenges posed by the increasing energy demands of AI systems and proposes that PFMs could enable the deployment of models with up to 10^18 parameters, far exceeding current capabilities. The authors provide calculations illustrating the potential of PFMs using optical examples and discuss the implications for various physical platforms, including nanoelectronics. They conclude by outlining the research challenges that must be addressed to realize the vision of trillion-parameter PFMs.
Methodology
The authors propose a conceptual framework for Physical Foundation Models that utilizes analog physical media to perform neural network computations directly through the hardware's natural dynamics. They present back-of-the-envelope calculations to illustrate the scaling potential of PFMs, particularly using optical examples.
Results
The paper suggests that PFMs could achieve orders-of-magnitude improvements in energy efficiency and speed compared to traditional digital implementations. The calculations indicate that PFMs could support models with significantly higher parameter counts than currently feasible.
Implications
The development of PFMs could revolutionize the deployment of AI by making it feasible to run larger models in energy-constrained environments, such as edge devices, and could alleviate the growing energy demands of datacenters. This could lead to more sustainable AI practices and enhanced capabilities in various applications.
AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning
Federated Learning
- Introduction of AdaBFL, a multi-layer defensive aggregation method for Byzantine-robust federated learning.
- Development of a three-layer defense mechanism that adapts to different types of poisoning attacks.
- Theoretical proof of convergence for AdaBFL under non-convex settings with non-iid data.
- Extensive experimental validation demonstrating AdaBFL's effectiveness compared to existing methods.
Read more
AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning
Summary
The paper presents AdaBFL, a novel approach to enhance the robustness of federated learning (FL) against Byzantine attacks, which are malicious attempts to corrupt the training process by submitting poisoned model updates. Traditional FL methods are vulnerable to such attacks, as even a single malicious client can significantly degrade the performance of the global model. The authors propose a multi-layer defensive adaptive aggregation mechanism that employs a three-layer defense strategy, allowing for adaptive adjustment of defense weights to counter various attack types. The paper also establishes theoretical convergence properties of AdaBFL under non-convex settings with non-iid data distributions, demonstrating its effectiveness in maintaining model integrity. Extensive experiments across multiple datasets validate AdaBFL's superiority over existing aggregation methods, showcasing its resilience against sophisticated poisoning attacks and its ability to improve the overall performance of federated learning systems.
Methodology
The AdaBFL method utilizes a three-layer defensive mechanism that adaptively adjusts the weights of different aggregation strategies based on the nature of the attacks. The authors theoretically analyze the convergence of the method and conduct comprehensive experiments across various datasets and attack scenarios to validate its performance.
Results
The experiments show that AdaBFL outperforms existing Byzantine-robust aggregation methods in terms of maintaining the accuracy and integrity of the global model under various poisoning attacks. The theoretical analysis confirms that AdaBFL converges effectively even in challenging non-convex and non-iid settings.
Implications
AdaBFL has significant implications for enhancing the security and reliability of federated learning systems, particularly in sensitive applications where data privacy is paramount. Its adaptive nature allows for better resilience against evolving attack strategies, making it suitable for real-world deployment in diverse domains such as healthcare, finance, and beyond.
Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction
Time Series
Theory
Optimization
- Introduction of State-Adaptive Bayesian Conformal Prediction (SA-BCP) framework.
- SA-BCP effectively decouples temporal and spatial components for improved prediction intervals.
- Rigorous theoretical analysis establishes a minimax bias-variance tradeoff.
- Empirical results show significant improvements in minimizing under-coverage and interval bloat.
Read more
Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction
Summary
This paper addresses the challenges of Online Conformal Prediction (CP) in non-stationary time series, particularly the balance between temporal adaptability and structural stability. Existing methods, such as Adaptive Conformal Inference (ACI) and temporally discounted Bayesian CP, face issues like systemic under-coverage and uncalibrated interval bloat during abrupt shifts. The authors propose a novel framework called State-Adaptive Bayesian Conformal Prediction (SA-BCP), which decouples temporal and spatial components to enhance prediction intervals. SA-BCP utilizes a gating mechanism that combines long-term temporal inertia with spatial kernel-density evidence, allowing it to adaptively expand intervals for recognized historical regimes while maintaining efficiency during stable periods. The theoretical foundation of SA-BCP is rigorously established, demonstrating an optimal minimax bias-variance tradeoff governed by an evidence threshold. Empirical evaluations on volatile financial datasets show that SA-BCP consistently minimizes the Winkler score across various confidence levels, effectively addressing the under-coverage of ACI and reducing interval bloat by 10% to 37% under high-confidence requests. This work presents a significant advancement in uncertainty quantification for non-stationary time series forecasting.
Methodology
The methodology involves a spatio-temporal decoupling approach where SA-BCP combines temporal base density with spatial kernel density estimation. The temporal component captures recent volatility, while the spatial component accounts for historical state similarities. The framework is designed to adaptively adjust prediction intervals based on recognized historical regimes and current states, ensuring both reliability and efficiency.
Results
SA-BCP was empirically validated on financial datasets, showing consistent minimization of the Winkler score across different confidence levels. It resolved the systematic under-coverage issues found in ACI variants and reduced uncalibrated interval bloat by 10% to 37% for high-confidence predictions, demonstrating its effectiveness in real-world applications.
Implications
The findings suggest that SA-BCP can be applied in various domains requiring robust uncertainty quantification in non-stationary environments, such as finance, economics, and other fields involving time series data. Its ability to adaptively manage prediction intervals could enhance decision-making processes in volatile settings.
Learning from a single labeled face and a stream of unlabeled data
Computer Vision
- Introduces Online Manifold Tracking (OMT) for face recognition from a single labeled image and unlabeled data.
- Frames the problem as one-class classification, addressing the lack of negative examples.
- Achieves 90% identification accuracy with nearly zero false positives, outperforming existing methods.
- Demonstrates real-time performance with an average recognition time of 0.05 seconds.
Read more
Learning from a single labeled face and a stream of unlabeled data
Summary
This paper addresses the challenge of face recognition when only a single labeled image per person is available, a common scenario in personal authentication systems. The authors propose a novel approach termed Online Manifold Tracking (OMT), which leverages a stream of unlabeled data to enhance the learning process. The method is framed as a one-class classification problem, where the goal is to learn a non-parametric model of the face from the labeled image and the abundant unlabeled data. The OMT algorithm adapts to changes in the data and learns the underlying manifold structure in real-time, without the need for extensive offline training. The authors demonstrate that their method significantly outperforms traditional techniques, achieving a 90% identification rate with nearly zero false positives, which is a 15% improvement over the Fisherfaces method at the same false positive rate. Furthermore, the paper includes a sensitivity analysis of the method, providing guidelines for parameter settings to optimize performance.
Methodology
The authors propose the Online Manifold Tracking (OMT) algorithm, which learns the structure of the face manifold from a single labeled image and a continuous stream of unlabeled data. The method is non-parametric and adapts to variations in facial expressions and poses, utilizing the manifold structure of the data to improve recognition accuracy.
Results
The OMT method achieved a 90% identification rate with almost zero false positives when evaluated on a dataset of 43 individuals. This performance is notably 15% better than the Fisherfaces method at the same false positive rate. The algorithm also demonstrated real-time processing capabilities, recognizing faces in an average of 0.05 seconds.
Implications
This research has significant implications for personal authentication systems, particularly in environments where only a single labeled image is available. The ability to effectively utilize unlabeled data can enhance security measures in various applications, such as mobile devices and computer systems, making them more robust against unauthorized access.
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
Large Language Models
NLP
Efficient ML
- AutoSP is the first automated solution for optimizing LLM training for long-context tasks.
- It integrates sequence parallelism and activation-checkpointing into a PyTorch-native compilation framework.
- The method significantly increases the maximum input context length for LLMs without compromising training speed.
- AutoSP is compatible with both NVIDIA and AMD hardware, demonstrating versatility in application.
Read more
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
Summary
The paper introduces AutoSP, an innovative solution aimed at optimizing the training of large language models (LLMs) for long-context tasks, which require processing extensive input sequences. Traditional LLM training libraries focus on optimizing for model parameter counts rather than long-context capabilities, leading to challenges in integrating sequence parallelism (SP) into training pipelines. AutoSP addresses this by providing an automated, compiler-based approach that simplifies the implementation of sequence parallelism and long-context aware activation-checkpointing. This enables researchers to train models with significantly longer input contexts without the need for extensive manual code modifications. The evaluation of AutoSP on both NVIDIA and AMD hardware shows that it can increase training contexts by up to 2.7× and 2.5×, respectively, while maintaining runtime performance. This advancement not only enhances the trainability of LLMs but also improves developer productivity by reducing the complexity of integrating long-context optimizations.
Methodology
AutoSP employs a compiler-based approach to implement sequence parallelism in PyTorch-2.0. It consists of two main components: a sequence-parallel transformation pass that automates the insertion of communication collectives and reshapes activations, and a sequence-aware activation checkpointing pass that optimizes memory usage during long-context training. This allows for seamless integration into existing training pipelines with minimal code changes.
Results
The evaluation of AutoSP reveals that it can extend training contexts by up to 2.7× on NVIDIA GPUs and 2.5× on AMD GPUs compared to competitive hand-written baselines, all while incurring negligible costs to runtime performance.
Implications
The development of AutoSP has significant implications for the training of large language models, particularly in applications requiring long-context understanding, such as document analysis, multi-step reasoning, and extended dialogues. By simplifying the integration of complex optimizations, AutoSP can accelerate research and development in the field of NLP and enhance the capabilities of LLMs.
An adaptive wavelet-based PINN for problems with localized high-magnitude source
Theory
Optimization
Efficient ML
- AW-PINN effectively addresses loss imbalance in PINNs for PDEs with localized high-magnitude sources.
- The framework adapts wavelet basis functions dynamically, improving efficiency and accuracy.
- AW-PINN does not require automatic differentiation, accelerating the training process.
- The method shows superior performance on various PDEs compared to existing techniques.
Read more
An adaptive wavelet-based PINN for problems with localized high-magnitude source
Summary
This paper introduces an adaptive wavelet-based physics-informed neural network (AW-PINN) to address the challenges of loss imbalance and spectral bias in traditional PINNs when solving partial differential equations (PDEs) with localized high-magnitude source terms. The AW-PINN framework dynamically adjusts the wavelet basis functions based on the residual and supervised loss, allowing it to effectively manage high-scale features without excessive memory usage. The method operates in two stages: an initial pre-training phase to select relevant wavelet families followed by adaptive refinement of scales and translations. Notably, AW-PINN does not depend on automatic differentiation for calculating derivatives in the loss function, which enhances training speed. Theoretical analysis demonstrates that AW-PINN can be associated with a Gaussian process limit and its neural tangent kernel (NTK) structure. The performance of AW-PINN is evaluated on various challenging PDEs, including transient heat conduction and Maxwell’s equations, showing significant improvements over existing methods, particularly in scenarios with extreme loss imbalances.
Methodology
The AW-PINN framework operates in two phases: an initial pre-training phase with fixed wavelet bases to identify relevant wavelet families, followed by an adaptive refinement phase that adjusts scales and translations based on the loss dynamics. This approach allows for efficient handling of high-frequency components without the need for high-resolution bases across the entire domain.
Results
AW-PINN was tested on several PDEs featuring localized high-magnitude source terms, achieving performance improvements over existing methods, particularly in cases with loss imbalances as high as 10^10:1. The results indicate that AW-PINN consistently outperforms traditional PINNs and other loss balancing techniques.
Implications
The AW-PINN framework has significant implications for various fields requiring the solution of PDEs with localized sources, such as thermal processing, electromagnetics, and fluid dynamics. Its ability to manage loss imbalance and spectral bias could enhance the accuracy and efficiency of simulations in these domains.
Batch Normalization for Neural Networks on Complex Domains
Theory
- Introduction of batch normalization layers for neural networks on complex domains.
- Focus on less-studied complex domains like the Siegel disk and complex unit ball.
- Demonstrated improvements in training stability and accuracy in various machine learning tasks.
- Connection to existing Riemannian batch normalization layers.
Read more
Batch Normalization for Neural Networks on Complex Domains
Summary
This paper introduces a novel approach to batch normalization (BN) specifically tailored for neural networks operating on complex domains, such as the Siegel disk domain and the complex unit ball. The authors highlight the significance of Riemannian neural networks in various machine learning applications and propose a general BN layer that is applicable to both complex domains and traditional Riemannian manifolds. The paper outlines the mathematical foundations necessary for implementing BN layers in these less-explored complex domains, emphasizing the potential of these spaces in capturing intricate geometrical structures. Through a series of experiments, the authors demonstrate that their proposed BN layers significantly enhance training stability and improve classification accuracy across tasks such as radar clutter classification, node classification, and action recognition. The findings suggest that the integration of BN layers into complex domain neural networks can lead to systematic performance improvements over existing methods.
Methodology
The authors derive essential components for implementing batch normalization layers in complex domains, leveraging the Kobayashi pseudodistance as a foundational metric. They conduct experiments to evaluate the performance of their proposed BN layers in various machine learning tasks, comparing results with existing methods.
Results
The proposed batch normalization layers for complex domains resulted in systematic improvements in training stability and accuracy across several tasks, including radar clutter classification, node classification, and action recognition, outperforming existing techniques.
Implications
The findings suggest that incorporating batch normalization in complex domain neural networks can enhance their applicability in machine learning tasks, particularly in areas requiring the modeling of complex geometrical structures. This could lead to advancements in fields such as signal processing and graph-structured data analysis.
AirFM-DDA: Air-Interface Foundation Model in the Delay-Doppler-Angle Domain for AI-Native 6G
Optimization
Efficient ML
Theory
- AirFM-DDA operates in the Delay-Doppler-Angle domain, improving the representation of multipath components.
- The model utilizes a window-based attention mechanism to reduce computational complexity.
- AirFM-DDA achieves superior zero-shot generalization and outperforms existing models in channel-related tasks.
- The model demonstrates robustness under high mobility and severe noise conditions.
Read more
AirFM-DDA: Air-Interface Foundation Model in the Delay-Doppler-Angle Domain for AI-Native 6G
Summary
The paper introduces AirFM-DDA, a novel Air-interface Foundation Model designed for the physical layer of AI-native 6G networks. Unlike existing models that operate in the space-time-frequency (STF) domain, which complicates the learning of universal channel representations due to the superimposition of multipath components, AirFM-DDA reparameterizes channel state information (CSI) into the Delay-Doppler-Angle (DDA) domain. This transformation allows for a clearer separation of multipath components along meaningful physical axes, enhancing representation learning. The model employs a window-based attention mechanism combined with frame-structure-aware positional encoding (FS-PE), significantly reducing computational overhead compared to traditional global attention mechanisms. Extensive experiments demonstrate that AirFM-DDA excels in zero-shot generalization across various unseen scenarios and datasets, outperforming baseline models in channel prediction and estimation tasks. Additionally, the model shows robustness under challenging conditions, such as high mobility and severe noise, while reducing training and inference costs by nearly an order of magnitude.
Methodology
AirFM-DDA employs a four-dimensional Fourier transform to reparameterize CSI from the STF domain to the DDA domain, allowing for clearer representation of multipath components. It integrates a window-based attention mechanism with FS-PE to enhance learning efficiency and reduce computational overhead.
Results
The model consistently outperformed baseline models in channel prediction and estimation tasks, demonstrating superior zero-shot generalization across various datasets and scenarios. It also showed a significant reduction in training and inference costs, nearly an order of magnitude lower than models using global attention mechanisms.
Implications
The development of AirFM-DDA has significant implications for the design of AI-native 6G networks, enabling more efficient and robust physical layer tasks such as channel estimation, prediction, and beam management. Its approach could lead to advancements in wireless communication technologies, particularly in high-mobility environments.
Cost-Aware Learning
Reinforcement Learning
Large Language Models
Optimization
- Introduction of Cost-Aware Learning framework for machine learning with varying sample costs.
- Development of Cost-Aware SGD algorithm with theoretical guarantees on cost and error.
- Proposal of Cost-Aware GRPO for efficient policy optimization in reinforcement learning.
- Empirical results indicate significant reductions in training costs while preserving model performance.
Read more
Cost-Aware Learning
Summary
This paper introduces the concept of Cost-Aware Learning, which addresses the challenge of varying computational costs associated with different training samples in machine learning. The authors propose a novel algorithm, Cost-Aware Stochastic Gradient Descent (SGD), tailored for convex functions, which minimizes the total training cost while achieving a target error. They derive the cost complexity for reaching a specified error level and establish a lower bound for this approach. Additionally, a subset selection algorithm is introduced to further reduce training costs. The theoretical framework is applied to reinforcement learning, specifically in the context of language models, leading to the development of the Cost-Aware GRPO algorithm. This algorithm optimizes policy gradients by considering the computational cost associated with sequence length, demonstrating significant reductions in token usage during training without sacrificing accuracy. Empirical evaluations on large language models (LLMs) show that the proposed methods can reduce the tokens used in policy optimization by up to 30% while maintaining or improving baseline accuracy.
Methodology
The authors develop the Cost-Aware SGD algorithm, which utilizes principles of importance sampling to optimize the training process by balancing the gradient magnitude against the sample cost. They also introduce a dataset selection strategy to reduce training costs further. For reinforcement learning applications, they adapt these principles to create the Cost-Aware GRPO algorithm, which modifies the sampling strategy used in policy optimization.
Results
The empirical results demonstrate that the Cost-Aware GRPO algorithm can achieve up to 30% reduction in tokens used for policy optimization compared to standard methods, while either matching or exceeding the accuracy of baseline models. The theoretical analysis provides finite-time guarantees and a lower bound on the cost required to achieve a specified error.
Implications
The findings suggest that incorporating cost-awareness into machine learning training processes can lead to more efficient resource utilization, particularly in large-scale models and reinforcement learning scenarios. This approach can significantly reduce computational costs, making it more feasible to train complex models in resource-constrained environments.
Preserving Temporal Dynamics in Time Series Generation
Generative Models
Time Series
- Proposes an MCMC-based framework to preserve temporal dynamics in synthetic time series generation.
- Highlights the limitations of existing GAN approaches that focus on marginal distribution matching.
- Demonstrates that the MCMC framework improves temporal fidelity and predictive performance across multiple datasets.
- Provides a theoretical analysis of distribution shift in autoregressive generation.
Read more
Preserving Temporal Dynamics in Time Series Generation
Summary
This paper addresses the challenge of generating synthetic time series data while preserving the underlying temporal dynamics, which is crucial for regression-oriented forecasting tasks. Existing generative models, particularly Generative Adversarial Networks (GANs), often focus on matching marginal distributions and neglect the temporal relationships inherent in multivariate time series. This oversight can lead to distribution shifts and temporal drifts, ultimately degrading the quality of the generated sequences. The authors propose a model-agnostic Markov Chain Monte Carlo (MCMC)-based framework that corrects these discrepancies by enforcing consistency with empirical transition statistics between neighboring time points. Through theoretical analysis, the paper elucidates how deviations accumulate in conditional generative models during autoregressive generation. The proposed MCMC framework is evaluated against several benchmark datasets, demonstrating significant improvements in temporal fidelity and predictive performance compared to state-of-the-art GAN architectures. The findings suggest that preserving temporal dynamics is essential for generating high-quality synthetic time series data.
Methodology
The authors introduce a Markov Chain Monte Carlo (MCMC) correction module that explicitly addresses distributional and dynamical deviations in generated time series. The framework enforces consistency with empirical transition statistics, thereby correcting errors that accumulate during autoregressive generation.
Results
Extensive experiments on datasets such as Lorenz, Licor, ETTh, and ILI show that the proposed MCMC framework consistently enhances autocorrelation alignment, skewness error, kurtosis error, R2, discriminative score, and predictive score compared to existing GAN architectures like RCGAN, GCWGAN, TimeGAN, SigCWGAN, and AECGAN.
Implications
The findings suggest that for effective time series generation, models should prioritize the preservation of temporal dynamics over merely matching marginal distributions. This has implications for various applications in forecasting and data augmentation in fields such as finance, environmental monitoring, and healthcare.
Context-Aware Graph Attention for Unsupervised Telco Anomaly Detection
Graph Learning
Time Series
Efficient ML
- C-MTAD-GAT is a context-aware, unsupervised anomaly detection model tailored for mobile network KPIs.
- The model combines graph attention with context embeddings to effectively handle multivariate time series data.
- Detection thresholds are calibrated without labeled data, maintaining a fully unsupervised pipeline.
- C-MTAD-GAT outperforms existing models in both precision and recall while minimizing false alarms.
Read more
Context-Aware Graph Attention for Unsupervised Telco Anomaly Detection
Summary
This paper introduces C-MTAD-GAT, an innovative unsupervised anomaly detection model specifically designed for multivariate time series data from mobile networks. The model integrates graph attention mechanisms with lightweight context embeddings to enhance its performance in detecting anomalies without the need for labeled data. C-MTAD-GAT employs a deterministic reconstruction head and a multi-step forecaster to generate anomaly scores, while calibration of detection thresholds is achieved through validation residuals, ensuring a fully unsupervised approach. The authors validate the model on the TELCO dataset, demonstrating that C-MTAD-GAT outperforms existing state-of-the-art methods, including MTAD-GAT and DC-VAE, in terms of event-level and pointwise F1 scores, while also reducing the number of false alarms. Furthermore, the model has been successfully deployed in the Core network of a national mobile operator, showcasing its practical applicability and robustness in real-world scenarios.
Methodology
C-MTAD-GAT utilizes a centralized architecture that processes multivariate time series data from various network elements. It incorporates context embeddings for both static and dynamic features, enhancing the model's ability to adapt to diverse network conditions. The model employs a deterministic GRU-based reconstruction head and a multi-step forecasting mechanism to derive anomaly scores from residuals. Anomaly detection thresholds are determined using simple statistics on validation errors, avoiding the need for labeled data.
Results
In experiments on the TELCO dataset, C-MTAD-GAT consistently achieved superior performance compared to MTAD-GAT and DC-VAE, with improved event-level and pointwise F1 scores. The model also demonstrated a significant reduction in false alarms, indicating its effectiveness in practical applications. The deployment in a national mobile operator's Core network further validated its robustness and reliability in real-world conditions.
Implications
The development of C-MTAD-GAT has significant implications for the telecommunications industry, particularly in enhancing the efficiency and reliability of anomaly detection systems. Its fully unsupervised nature allows for continuous monitoring of network performance without the need for extensive labeled datasets, making it a valuable tool for operational environments. This model can potentially be adapted for use in other domains requiring anomaly detection in multivariate time series data.
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
Large Language Models
NLP
- ChipLingo provides a systematic training pipeline for adapting LLMs to the EDA domain.
- The framework includes data curation, domain-adaptive pretraining, and RAG scenario training.
- Experimental results demonstrate significant performance improvements over baseline models.
- The study highlights the importance of QA augmentation and specific training strategies for domain adaptation.
Read more
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
Summary
The paper introduces ChipLingo, a systematic training framework designed to adapt large language models (LLMs) for Electronic Design Automation (EDA) tasks. The complexity of EDA tools necessitates a tailored approach to leverage LLMs effectively, as general models often lack domain-specific knowledge and struggle with cross-tool terminology. ChipLingo's training pipeline consists of three stages: constructing domain-specific corpora through multi-source data curation and question-answering augmentation, conducting domain-adaptive pretraining with various parameter training strategies, and enhancing retrieval-augmented generation (RAG) capabilities through instruction alignment and scenario training. The authors evaluate the performance of their models using a newly curated benchmark, EDA-Bench, which includes various EDA tool scenarios. Experimental results show that ChipLingo-8B achieves 59.7% accuracy on EDA-Bench, while ChipLingo-32B reaches 70.02%, indicating significant improvements over baseline models and approaching the performance of leading commercial models. The findings suggest that systematic domain training can enhance LLM performance in knowledge-intensive EDA tasks and facilitate the development of intelligent EDA agents.
Methodology
The methodology involves a three-stage training pipeline: 1) constructing domain-specific corpora through multi-source data curation and QA augmentation, 2) conducting domain-adaptive pretraining with comparisons of different parameter training strategies, and 3) enhancing model capabilities through instruction alignment and RAG scenario training.
Results
ChipLingo-8B achieved 59.7% accuracy on the EDA-Bench, outperforming its base model and some larger general-purpose models. ChipLingo-32B reached 70.02% accuracy, nearing the performance of leading commercial models. The study found that QA augmentation improved domain performance and that Partial FT provided a better balance between domain adaptation and general capability retention.
Implications
The findings suggest that ChipLingo can significantly enhance the performance of LLMs in EDA tasks, paving the way for the development of intelligent EDA agents that can efficiently utilize external knowledge and improve design workflows.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- ResRL effectively decouples the semantic distributions of positive and negative responses to enhance reasoning diversity.
- The framework introduces a theoretical connection between Lazy Likelihood Displacement and gradient interference, providing a new proxy for gradient updates.
- Empirical results show ResRL surpasses existing methods like NSR and GRPO in various reasoning tasks.
- The method employs low-rank approximation for computational efficiency while maintaining performance.
Read more
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
Summary
This paper introduces ResRL, a novel framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) through a method called Negative Sample Projection Residual Reinforcement Learning. The authors identify that existing approaches, particularly Reinforcement Learning with Verifiable Rewards (RLVR) and Negative Sample Reinforcement (NSR), often lead to reduced generation diversity due to over-incentivization of positive rewards. ResRL addresses this by decoupling the semantic distributions of positive and negative responses, thereby allowing for improved reasoning without sacrificing diversity. The methodology involves projecting negative-token hidden representations onto a low-rank positive subspace and using the projection residuals to adjust negative gradients. Theoretical foundations are established linking Lazy Likelihood Displacement (LLD) to gradient interference, leading to a proxy metric that guides conservative advantage reweighting. Extensive experiments demonstrate that ResRL outperforms strong baselines across twelve benchmarks, particularly excelling in mathematical reasoning tasks.
Methodology
ResRL utilizes a novel approach that projects negative sample representations onto a low-rank positive subspace, allowing for the modulation of negative gradients based on projection residuals. This is coupled with a theoretical framework that links LLD to gradient interference, enabling a more effective gradient update mechanism that preserves valid semantic components while suppressing erroneous reasoning patterns.
Results
ResRL achieved state-of-the-art performance across twelve benchmarks, with notable improvements in mathematical reasoning metrics, outperforming NSR by 9.4% in Avg@16 and 7.0% in Pass@128. The method demonstrated enhanced reasoning capabilities while maintaining output diversity, addressing the limitations of previous RLVR approaches.
Implications
The findings suggest that ResRL can be applied to improve reasoning in various LLM applications, particularly in scenarios requiring high accuracy and diversity in output. This could lead to advancements in fields such as automated reasoning, code generation, and interactive AI systems.
Federated Learning with Hypergradient-based Online Update of Aggregation Weights
Federated Learning
- Introduction of FedHAW for online aggregation weight updates in federated learning.
- Utilization of hypergradient descent for efficient adaptation to heterogeneous data and communication environments.
- Elimination of the need for additional training data compared to existing methods like FedLAW.
- Demonstration of high generalization performance and robustness to communication errors through simulations.
Read more
Federated Learning with Hypergradient-based Online Update of Aggregation Weights
Summary
The paper presents FedHAW, a novel approach to federated learning (FL) that addresses the challenges posed by heterogeneous client data distributions and unstable communication environments. FedHAW employs hypergradient descent to enable online updates of aggregation weights during the FL training process. This method allows for adaptive adjustments to the learning process without the need for additional training data, thereby reducing computational overhead. The authors highlight the limitations of existing methods, such as FedLAW, which require pre-prepared data for aggregation weight learning. By integrating hypergradient-based updates, FedHAW enhances the robustness and generalization performance of FL systems, particularly in scenarios with varying client capabilities and communication errors. Simulation results demonstrate that FedHAW achieves high generalization performance and adaptability, making it a promising solution for federated learning applications in mobile and IoT environments.
Methodology
The methodology involves the application of hypergradient descent to update aggregation weights in each round of the federated learning process. This allows for real-time adjustments based on the current learning environment, enhancing the adaptability of the model to heterogeneous data distributions and communication conditions. The approach is designed to minimize computational overhead and does not require pre-prepared datasets for weight learning.
Results
Simulation experiments indicate that FedHAW significantly improves generalization performance in heterogeneous environments and demonstrates robustness against communication errors. The method's online capability allows it to effectively track changes in the learning environment, leading to better model performance compared to traditional federated learning methods.
Implications
The findings suggest that FedHAW can be effectively applied in real-world federated learning scenarios, particularly in mobile and IoT applications where data privacy and communication reliability are critical. The ability to adaptively update aggregation weights can lead to more efficient and accurate federated learning systems, enhancing user experience and data security.
Towards Robust and Scalable Density-based Clustering via Graph Propagation
Graph Learning
Efficient ML
Theory
- CluProp redefines density-based clustering as a graph propagation process, improving scalability and robustness.
- The DANE algorithm enables efficient label propagation from local density peaks, enhancing clustering in heterogeneous datasets.
- CluProp achieves superior performance on large-scale datasets, processing millions of points quickly while maintaining high accuracy.
Read more
Towards Robust and Scalable Density-based Clustering via Graph Propagation
Summary
This paper introduces CluProp, a novel framework that enhances density-based clustering by framing it as a label propagation process over neighborhood graphs. The authors address the limitations of traditional density-based methods, such as DBSCAN and Density Peak Clustering (DPC), which struggle with varied-density datasets due to their reliance on global parameters. CluProp employs a deterministic density-based propagation strategy called DANE (Density-Aware Neighborhood Expansion) that allows for scalable neighborhood identification and clustering without the rigid connectivity thresholds of traditional methods. The framework is metric-agnostic and leverages efficient modularity-based propagation techniques, such as Louvain and Leiden, to improve clustering accuracy. The authors demonstrate that CluProp can process millions of points in minutes, significantly outperforming existing methods in terms of accuracy and runtime efficiency.
Methodology
The authors propose CluProp, which utilizes a deterministic density-based propagation strategy (DANE) over approximate k-nearest neighbor graphs. This method allows for efficient neighborhood identification and label propagation, effectively linking classical density-based clustering with graph-based approaches. The framework is designed to be scalable and metric-agnostic, leveraging optimized implementations for clustering and label propagation.
Results
CluProp demonstrated remarkable performance on the MNIST dataset, achieving a 90% Adjusted Mutual Information (AMI) score in just 20 seconds, while existing methods like DCN took over 30 minutes to reach a 75% AMI score. On the larger MNIST8M dataset, CluProp completed clustering in under 15 minutes, achieving an 80% Normalized Mutual Information (NMI), significantly outperforming kernel k-means, which only achieved 41% NMI on a supercomputing cluster.
Implications
The proposed framework has significant implications for clustering tasks in high-dimensional and varied-density datasets, making it suitable for applications in fields such as image processing, data mining, and any domain requiring efficient clustering of large datasets.
Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care
Reinforcement Learning
Theory
Optimization
- Overrides should be viewed as implicit preference signals rather than compliance failures.
- A dual learning architecture is proposed to train both reward and capability models simultaneously.
- Override data in chronic disease management has unique properties that enhance preference learning.
- Clinician capability significantly influences decision-making and should be factored into AI training.
Read more
Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care
Summary
This paper reframes clinician overrides of clinical AI recommendations as valuable implicit preference data, akin to reinforcement learning from human feedback (RLHF). The authors propose a formal framework that enhances standard preference learning through three main contributions: a five-category taxonomy of overrides that links them to specific model update targets; a preference formulation that considers patient state, organizational context, and clinician capability; and a dual learning architecture that trains both a reward model and a capability model to mitigate suppression bias. The authors argue that chronic disease management under outcome-based payment contracts generates override data with unique characteristics, such as longitudinal density and observable outcomes, which are essential for developing a reward model that aligns with patient trajectories rather than encounter economics. This framework was developed from practical efforts to enhance clinician capability in a real-world value-based care setting.
Methodology
The authors developed a formal framework that includes a taxonomy of clinician overrides, a preference formulation based on various contextual factors, and a dual learning architecture that allows for alternating optimization of reward and capability models.
Results
The proposed framework successfully identifies and utilizes clinician overrides as rich signals for preference learning, demonstrating that clinician capability affects decision-making and that training AI systems with this data can lead to better alignment with patient outcomes.
Implications
This research has significant implications for the development of clinical decision support systems, suggesting that AI can be improved by incorporating clinician feedback as preference signals, ultimately enhancing patient care in value-based healthcare settings.
Fair Dataset Distillation via Cross-Group Barycenter Alignment
Theory
Optimization
- Bias amplification in dataset distillation arises from the interaction between group imbalance and representational separation.
- COBRA framework introduces a barycenter alignment approach to ensure fair representation across demographic groups.
- The proposed method is compatible with existing dataset distillation techniques.
- Empirical results show significant fairness improvements across various datasets and distillation methods.
Read more
Fair Dataset Distillation via Cross-Group Barycenter Alignment
Summary
This paper addresses the challenge of fairness in dataset distillation, which compresses large datasets into smaller synthetic ones while maintaining predictive performance. The authors identify that different demographic groups exhibit distinct predictive patterns, leading to performance drops for certain subgroups when using traditional distillation methods. They argue that these fairness gaps arise not only from group size imbalances but also from fundamental differences in subgroup predictive patterns. To tackle this issue, the authors propose a new framework called COBRA (Cross-group Barycenter Alignment), which computes a barycenter of subgroup representations that is agnostic to group imbalance. By distilling data towards this barycenter, the method ensures that all subgroups are represented fairly, thereby reducing bias. The paper provides theoretical analysis and empirical validation, demonstrating that COBRA effectively mitigates bias amplification in dataset distillation and achieves state-of-the-art fairness performance compared to existing methods.
Methodology
The authors develop COBRA, which involves a two-step process: first, computing a barycenter of subgroup representations that minimizes distance to each group, and second, distilling synthetic data towards this barycenter. This approach is designed to be agnostic to group sizes, ensuring equitable representation.
Results
The experiments demonstrate that COBRA reduces bias amplification significantly compared to traditional dataset distillation methods. It achieves state-of-the-art fairness performance, showing consistent improvements across diverse datasets and distillation techniques.
Implications
The findings suggest that incorporating fairness considerations into dataset distillation is crucial, especially for applications in sensitive areas like healthcare and finance. The COBRA framework can be applied to enhance fairness in machine learning models trained on distilled datasets.
A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions
Optimization
Theory
- Introduces the Dirac-Frenkel-Onsager principle to address non-uniqueness in parameter dynamics.
- Utilizes a history variable as momentum to promote smooth parameter evolution.
- Maintains the instantaneous residual minimization property of the Dirac-Frenkel principle.
- Demonstrates increased robustness in singular and near-singular regimes.
Read more
A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions
Summary
This paper introduces the Dirac-Frenkel-Onsager (DFO) principle, which addresses the challenges of non-unique parameter dynamics and ill-conditioning in the context of nonlinear parametrizations of partial differential equations (PDEs). The authors interpret the non-uniqueness of parameter dynamics as a form of gauge freedom, allowing for the selection of better-conditioned parameter velocities. By incorporating a history variable interpreted as momentum, the DFO principle optimally updates this variable along nullspace directions, ensuring that the instantaneous residual minimization condition of the Dirac-Frenkel principle remains intact. This approach promotes smoother parameter evolutions and enhances robustness in singular and near-singular regimes. The paper demonstrates the effectiveness of the DFO principle through examples, showcasing its advantages over traditional regularization methods that may introduce bias.
Methodology
The authors build on the Dirac-Frenkel variational principle, incorporating a history variable that acts as momentum to optimize parameter velocities. This momentum is injected only along nullspace directions of the parametrization Jacobian, ensuring that the Dirac-Frenkel optimality condition is preserved while promoting temporal smoothness in parameter updates.
Results
The proposed DFO dynamics lead to improved robustness in the training of neural networks for PDE solutions, particularly in challenging singular and near-singular scenarios. The examples provided in the paper illustrate the effectiveness of the DFO principle in achieving smoother and more stable parameter evolutions compared to traditional methods.
Implications
The DFO principle has potential applications in various fields that utilize neural networks for solving PDEs, such as computational physics, engineering, and applied mathematics. Its ability to enhance robustness and stability in parameter dynamics could lead to more reliable models in complex systems.
Anomaly Detection in Soil Heavy Metal Contamination Using Unsupervised Learning for Environmental Risk Assessment
Theory
- Unsupervised learning effectively identifies anomalous heavy metal contamination in soil.
- Isolation Forest and PCA reconstruction error detected significant anomalies in soil samples.
- The study found that anomalies had 70-80% higher health risk indices compared to normal samples.
- Three distinct types of anomalies were identified, indicating varied contamination patterns.
Read more
Anomaly Detection in Soil Heavy Metal Contamination Using Unsupervised Learning for Environmental Risk Assessment
Summary
This study addresses the critical issue of soil contamination by heavy metals in Ghana, particularly in rapidly urbanizing areas with unregulated waste disposal. The authors employ an unsupervised machine learning framework to detect and characterize anomalous heavy metal contamination patterns in soil samples from twelve waste sites and residential controls in the Central Region of Ghana. They analyze concentrations of eight heavy metals (As, Cd, Cr, Cu, Hg, Ni, Pb, Zn) alongside health risk indices, including the Hazard Index (HI) and Incremental Lifetime Cancer Risk (ILCR). The study utilizes several anomaly detection algorithms, including Isolation Forest and PCA reconstruction error, which identified 12 anomalous samples (15.4% of 78 samples). A consensus approach further isolated six robust anomalies (7.7%), all located at a single site (S3). The anomalies exhibited significantly higher mean HI values than normal samples, indicating a pressing health risk. The research highlights the effectiveness of unsupervised learning in providing detailed insights into contamination patterns, enabling targeted environmental management and risk assessment.
Methodology
The study employed unsupervised machine learning techniques, specifically Isolation Forest, DBSCAN, and PCA-based reconstruction error, to analyze soil samples for heavy metal contamination. The data included concentrations of eight metals and health risk indices. Anomaly detection algorithms were compared to identify significant deviations from normal contamination patterns.
Results
The analysis revealed 12 anomalous samples, with a consensus approach isolating six robust anomalies concentrated at site S3. These anomalies showed mean HI values exceeding the threshold of 1, indicating serious health risks. The PCA reconstruction error demonstrated a strong correlation with HI, suggesting consistency between multivariate deviations and health risks.
Implications
The findings underscore the potential of unsupervised machine learning in environmental monitoring, allowing for more precise identification of contamination hotspots. This can inform targeted interventions and improve public health outcomes in affected regions.
PROMISE-AD: Progression-aware Multi-horizon Survival Estimation for Alzheimer's Disease Progression and Dynamic Tracking
Time Series
- PROMISE-AD effectively handles irregular clinical histories and missing data through progression-aware visit tokenization.
- The framework employs a temporal Transformer to balance long-term progression history with recent clinical states.
- It utilizes a hybrid approach combining discrete-time mixture hazards with various regularization techniques for calibrated risk estimation.
- The model achieved the lowest integrated Brier score for CN-to-MCI conversion and the highest C-index for MCI-to-AD conversion among compared methods.
Read more
PROMISE-AD: Progression-aware Multi-horizon Survival Estimation for Alzheimer's Disease Progression and Dynamic Tracking
Summary
The paper presents PROMISE-AD, a novel framework designed for predicting the progression of Alzheimer's disease (AD) from cognitively normal (CN) status to mild cognitive impairment (MCI) and subsequently to AD dementia. The framework addresses key challenges in AD progression prediction, including irregular visit patterns, censoring of data, and the need for calibrated risk estimates over multiple time horizons. PROMISE-AD utilizes a unique tokenization method to encode clinical visit data, incorporating standardized measurements, missingness indicators, longitudinal changes, and non-diagnostic attributes while avoiding diagnostic leakage. A temporal Transformer model is employed to fuse various representations of the data, enabling the estimation of progression scores and latent discrete-time mixture hazards. The training process integrates multiple objectives, including survival likelihood and horizon-specific risk loss, followed by isotonic calibration for risk estimation at 1, 2, 3, and 5 years. The framework was evaluated using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and TADPOLE Challenge, demonstrating superior performance in predicting CN-to-MCI and MCI-to-AD conversions compared to existing methods.
Methodology
PROMISE-AD employs a progression-aware visit tokenization strategy to encode clinical visit data, which includes various attributes such as measurements, missingness masks, and longitudinal changes. A temporal Transformer model is used to fuse these representations, allowing for the estimation of progression scores and latent hazards. The training process incorporates survival likelihood, horizon-specific focal risk loss, and several regularization techniques to ensure robust and calibrated risk estimates.
Results
In testing, PROMISE-AD achieved an integrated Brier score of 0.085 ± 0.012 and a C-index of 0.808 ± 0.015 for CN-to-MCI conversion, outperforming other methods. For MCI-to-AD conversion, it achieved a C-index of 0.894 ± 0.018 and near-ceiling performance in 5-year discrimination metrics, indicating high predictive accuracy.
Implications
The findings suggest that PROMISE-AD can significantly enhance the prediction of Alzheimer's disease progression, which could lead to better patient management and targeted interventions. The framework's ability to provide interpretable multi-horizon risk estimates may also facilitate clinical decision-making and trial enrichment strategies.
Distributional Alignment Games for Answer-Level Fine-Tuning
NLP
Large Language Models
Optimization
- Introduces a game-theoretical framework for optimizing language models based on answer correctness.
- Transforms the intractable marginalization problem into a tractable projection problem using Distributional Alignment Games.
- Unifies various alignment strategies, including diversity and coherence, under a single theoretical lens.
- Demonstrates significant complexity gains in reasoning tasks through efficient algorithms like Coherence-GRPO.
Read more
Distributional Alignment Games for Answer-Level Fine-Tuning
Summary
This paper addresses the challenge of Answer-Level Fine-Tuning (ALFT) for language models, focusing on optimizing the correctness of final answers rather than the reasoning paths that lead to them. The authors propose a novel game-theoretical framework termed Distributional Alignment Games, which reformulates ALFT as a two-player game involving a Policy (the generator) and a Target (an auxiliary distribution). By proving that the Nash Equilibrium of this game corresponds to the solution of the original optimization problem, they transform the intractable marginalization of latent reasoning paths into a tractable projection problem. This framework not only unifies various approaches to diversity and coherence in language models but also introduces efficient algorithms compatible with Group Relative Policy Optimization (GRPO). The authors demonstrate significant improvements in mathematical reasoning tasks, showcasing the effectiveness of their approach in enhancing the performance of language models while maintaining flexibility in exploring diverse reasoning paths.
Methodology
The authors formulate ALFT as a two-player game using Fenchel duality, where the Policy minimizes the divergence from a Target distribution that adapts to enforce desired properties like coherence and diversity. They develop algorithms compatible with Group Relative Policy Optimization (GRPO) to efficiently solve this game.
Results
The proposed framework and algorithms yield significant improvements in answer-level coherence and reasoning tasks, as evidenced by experimental results on datasets like GSM8K and TriviaQA. The approach effectively reduces variance in many-to-one mappings and enhances the overall performance of language models.
Implications
This work has potential applications in improving the performance of language models in reasoning-intensive tasks, enabling more robust and flexible models that can explore diverse reasoning paths while ensuring correctness in final answers. It also provides a theoretical foundation for future research in answer-level optimization and alignment strategies.
Free Energy Surface Sampling via Reduced Flow Matching
Efficient ML
Theory
- Introduction of FES-FM, a reduced flow matching method for free energy sampling.
- Utilization of a Hessian-informed prior distribution for many-particle systems.
- Significant reduction in computational costs while improving sampling accuracy.
- Demonstration of the method's effectiveness across various benchmark potentials.
Read more
Free Energy Surface Sampling via Reduced Flow Matching
Summary
This paper addresses the challenge of sampling the free energy surface, which is essential for understanding chemical reactions and conformational transitions in statistical physics. Traditional methods for sampling involve high-dimensional simulations, which can be computationally expensive. The authors propose a novel method called FES-FM (Free Energy Sampling via Flow Matching), which utilizes a reduced flow matching approach to directly sample the free energy surface in the space of collective variables (CVs). By training a dynamical transport map in the CV space, the method significantly reduces computational costs while maintaining high accuracy. The authors also introduce a Hessian-informed prior distribution that ensures the generated samples are physically meaningful and invariant under rotation and translation. The effectiveness of FES-FM is demonstrated through comparative experiments against traditional full-space generative methods, showing that it achieves superior accuracy per unit sampling time across various potential functions and CVs.
Methodology
The authors develop a reduced-space flow matching framework that learns a transport map in the CV space to sample the free energy surface directly. The training objective is derived from the transport equation and combined with non-equilibrium reweighting techniques. The method avoids full-space simulations during the generation of samples, which leads to a substantial reduction in computational costs.
Results
FES-FM was evaluated against traditional full-space generative baselines. The results indicate that FES-FM not only reduces the computational cost per sample but also provides superior accuracy in sampling the free energy surface across multiple benchmark potentials.
Implications
The proposed method has significant implications for computational chemistry and materials science, where efficient and accurate sampling of free energy surfaces is crucial for understanding complex molecular systems and reactions. It can potentially enhance the efficiency of simulations in various applications, including drug discovery and materials design.
From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting
Time Series
- Standard aggregate metrics can obscure critical failures in blood glucose forecasting models.
- The proposed task-aware evaluation framework includes both observational and interventional evaluation arms.
- Models may perform well on average but fail in high-risk scenarios, particularly post-bolus periods.
- The interventional evaluation reveals that many models struggle to predict the consequences of altered insulin dosing.
Read more
From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting
Summary
This paper addresses the challenges of evaluating blood glucose forecasting models in clinical settings, emphasizing that standard aggregate metrics may not accurately reflect a model's utility in safety-critical scenarios. The authors propose a task-aware evaluation framework that focuses on two primary applications: hypoglycemia early warning and insulin dosing decision support. They evaluate models using real clinical data from three cohorts, employing metrics that capture operational alarm burdens rather than just aggregate accuracy. The findings reveal that models with high overall recall can still perform poorly in critical situations, such as post-bolus periods where insulin levels are elevated. Additionally, the authors introduce an interventional evaluation using the FDA-approved UVA/Padova simulator to assess how well models predict glucose responses to changes in insulin dosing. This approach highlights a significant gap between forecasting accuracy and practical usefulness, as many models fail to accurately predict the effects of insulin interventions. The paper concludes by releasing a reproducible toolkit that includes a benchmark, preprocessing pipeline, and simulator-based dataset for future research.
Methodology
The authors developed a two-arm evaluation framework: the first arm assesses model performance on real observational data using event-level recall and false alarms per patient-day, while the second arm employs the UVA/Padova simulator to evaluate models' predictions of glucose responses to altered insulin dosing in counterfactual scenarios.
Results
The study found that models with high overall recall could still miss critical hypoglycemic events in specific clinical contexts, such as post-bolus periods. Furthermore, many models failed to accurately predict the direction and magnitude of glucose responses to changes in insulin dosing, indicating a disconnect between forecasting accuracy and clinical applicability.
Implications
The findings suggest that blood glucose forecasting models need to be evaluated with a focus on their practical utility in clinical settings, particularly for safety-critical applications. The proposed framework and toolkit can guide future research and development of more reliable forecasting models in diabetes management.
Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values
Federated Learning
Theory
Optimization
- Introduces K-Shapley value to measure arm contributions in BCMAB-FBF settings.
- Proposes K-SVFair-FBF algorithm that balances fairness and effective learning under full-bandit feedback.
- Achieves a theoretical regret bound of O(T 3/4), addressing noise from both learning and Monte Carlo methods.
- Demonstrates improved fairness and performance over existing methods in practical applications.
Read more
Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values
Summary
This paper introduces a novel framework for achieving meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). The authors extend the classical Shapley value from cooperative game theory to define the K-Shapley value, which captures the marginal contributions of arms within a restricted coalition size of K. This extension addresses the challenge of measuring individual arm contributions when only cumulative rewards are observed. The proposed K-SVFair-FBF algorithm adaptively estimates the K-Shapley value while mitigating noise from Monte Carlo approximations. Theoretical analysis shows that K-SVFair-FBF achieves a regret bound of O(T 3/4) on fairness regret, which is competitive given the additional complexities of the full-bandit feedback setting. Experimental results on federated learning and social influence maximization datasets demonstrate that the proposed approach not only ensures fairness but also outperforms existing baselines in terms of effectiveness.
Methodology
The authors extend the Shapley value to the K-Shapley value to define merit in BCMAB-FBF. They develop the K-SVFair-FBF algorithm, which estimates the K-Shapley value while addressing noise from Monte Carlo approximations. The algorithm operates under the constraints of full-bandit feedback, allowing it to learn the valuation function adaptively.
Results
The K-SVFair-FBF algorithm achieves a regret bound of O(T 3/4) for fairness regret, which is slightly higher than the best achievable regret of O(T 2/3) for standard bandit algorithms under full-bandit feedback. Experimental evaluations show that K-SVFair-FBF outperforms existing baselines in terms of both fairness and effectiveness on federated learning and social influence maximization datasets.
Implications
The findings suggest that incorporating meritocratic fairness into bandit algorithms can enhance participation and representation in applications like federated learning and social influence maximization. This approach can lead to more equitable resource distribution and improved outcomes in various combinatorial decision-making tasks.
Binomial flows: Denoising and flow matching for discrete ordinal data
Generative Models
Theory
Optimization
- Introduction of Binomial flows for generative modeling of discrete ordinal data.
- Establishment of a discrete analogue to Tweedie's formula using Binomial noise.
- Development of a framework that allows for denoising, sampling, and exact likelihood estimation.
- Validation of the methodology on synthetic and real-world datasets with competitive results.
Read more
Binomial flows: Denoising and flow matching for discrete ordinal data
Summary
This paper addresses the gap in flow-based generative modeling for discrete ordinal data, which has been largely unexplored compared to continuous data. The authors introduce a novel framework called Binomial flows, which utilizes Binomial noise as an analogue to Gaussian noise in continuous settings. This approach allows for the simultaneous learning of a denoiser, sampling, and exact likelihood estimation for discrete non-negative ordinal data. The framework is built upon the discrete Tweedie’s formula, which establishes a relationship between the denoiser and the data distribution. The authors validate their methodology through experiments on synthetic and real-world datasets, demonstrating competitive performance in generating samples and estimating likelihoods. The findings suggest that Binomial flows can effectively bridge the gap between continuous and discrete generative modeling, providing a robust tool for applications involving discrete ordinal data.
Methodology
The authors propose a framework that employs Binomial noise to perturb discrete ordinal data, allowing for the training of a denoiser that adheres to a discrete version of Tweedie's formula. This involves defining a family of conditional distributions that facilitate easy sampling and the recovery of clean data from noisy versions. The optimization process minimizes a Bregman divergence to learn the denoiser, which is then used to generate new samples from the data distribution.
Results
The methodology was validated through experiments on various datasets, yielding competitive Fréchet Inception Distance (FID) values on imaging datasets. The results indicate that the Binomial flow framework effectively captures the underlying data distribution and provides accurate likelihood estimates.
Implications
The introduction of Binomial flows has significant implications for generative modeling in discrete spaces, particularly in fields such as image generation, text processing, and any domain involving ordinal data. This framework could enhance the stability and performance of generative models in these areas, leading to better applications in real-world scenarios.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
Large Language Models
Efficient ML
Optimization
- ARHQ effectively reduces error propagation in low-bit quantization of LLMs.
- The method isolates error-sensitive weight directions using a residual Hessian approach.
- ARHQ significantly improves layer-wise SNR and reasoning performance in aggressive quantization scenarios.
- The approach is adaptable to specific quantization hardware and calibration distributions.
Read more
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
Summary
This technical report introduces Activation Residual Hessian Quantization (ARHQ), a novel post-training weight splitting method aimed at reducing error propagation in low-bit quantization of Large Language Models (LLMs). Traditional quantization methods often struggle with the trade-off between compression efficiency and inference fidelity, primarily due to the amplification of quantization noise from input activations. ARHQ addresses this issue by constructing an input-side residual Hessian from activation quantization residuals, allowing for the identification of error-sensitive weight directions. By applying a closed-form truncated Singular Value Decomposition (SVD) on the scaled weight matrix, ARHQ isolates these sensitive directions into a high-precision low-rank branch. Experimental evaluations on the Qwen3-4B-Thinking-2507 model demonstrate that ARHQ significantly enhances layer-wise Signal-to-Noise Ratio (SNR) and maintains downstream reasoning performance on ZebraLogic, even under aggressive quantization conditions. The method's adaptability to specific quantization hardware and its focus on mitigating noise amplification mark a significant advancement in low-bit quantization techniques.
Methodology
ARHQ employs a post-training weight splitting technique that constructs a residual Hessian from activation quantization residuals. It isolates error-sensitive weight directions into a high-precision low-rank branch using closed-form truncated SVD, optimizing a low-rank approximation problem based on the quantization residual covariance.
Results
Experimental results indicate that ARHQ leads to a marked improvement in layer-wise SNR and effectively preserves the reasoning performance of LLMs under aggressive quantization, as demonstrated on the Qwen3-4B-Thinking-2507 model.
Implications
The findings suggest that ARHQ can enhance the performance of low-bit quantized LLMs, making them more efficient for deployment in resource-constrained environments while maintaining high fidelity in inference tasks.
Unlearning Offline Stochastic Multi-Armed Bandits
Reinforcement Learning
Theory
Efficient ML
- Introduces the first study of unlearning in offline stochastic multi-armed bandits.
- Formalizes privacy constraints and utility measurement in the context of unlearning.
- Develops adaptive algorithms that combine Gaussian mechanism and rollback methods.
- Establishes theoretical performance guarantees and lower bounds for unlearning scenarios.
Read more
Unlearning Offline Stochastic Multi-Armed Bandits
Summary
This paper addresses the challenge of machine unlearning in the context of offline stochastic multi-armed bandits (MAB), a foundational problem in sequential decision-making. The authors formalize the concept of unlearning by introducing a privacy constraint and measuring utility through post-unlearning decision quality. They explore both single-source and multi-source unlearning scenarios under two data-generation models: the fixed-sample model and the distribution model. The proposed algorithms are based on two canonical methods, the Gaussian mechanism and rollback, and adaptively switch between them based on the data regime and privacy constraints. The paper provides theoretical performance guarantees and establishes lower bounds for the proposed methods, demonstrating their effectiveness through experiments that validate the predicted trade-offs between privacy and decision quality.
Methodology
The authors propose two base algorithms, the Gaussian mechanism and rollback, to address unlearning in offline MAB. They develop adaptive algorithms that switch between these methods based on the data regime and privacy constraints. The study includes a systematic analysis of single-source and multi-source unlearning scenarios under fixed-sample and distribution models, providing theoretical performance guarantees and lower bounds.
Results
The paper presents upper and lower bounds for the performance of the proposed unlearning algorithms across various settings. The results indicate that the algorithms can effectively balance the trade-off between privacy and decision quality, with performance guarantees established for both single-source and multi-source unlearning scenarios.
Implications
The findings of this study have significant implications for privacy-preserving machine learning, particularly in applications where data deletion requests are common. The proposed methods can enhance the privacy of decision-making systems, making them more robust against inference attacks while maintaining decision quality.