AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Time Series
- SL-BiLEM integrates behavioral dynamics into epidemic modeling, addressing feedback loops in disease spread.
- The model shows a 76% improvement over neural-mechanistic baselines and maintains robustness under distribution shifts.
- It provides interpretable components for effective transmission, facilitating counterfactual analysis for policy evaluation.
- SL-BiLEM achieves 100% bootstrap confidence interval coverage across synthetic counterfactual experiments.
Read more
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Summary
The paper addresses the challenges of epidemic forecasting and policy evaluation, emphasizing the dynamic nature of human behavior in response to disease spread. The authors introduce SL-BiLEM (Structured Learnable Behavior-in-the-Loop Epidemic Model), which integrates physical constraints to enhance the robustness of epidemic models against distribution shifts caused by policy interventions. The model decomposes effective transmission into interpretable components, allowing for both accurate forecasting and counterfactual analysis. By leveraging real-world datasets, including cruise ship and school influenza data, SL-BiLEM demonstrates significant improvements over traditional neural-mechanistic models, achieving better predictive validity and treatment effect accuracy. This framework aims to provide public health decision-makers with a reliable tool for intervention planning and resource allocation, addressing the critical need for unified epidemic modeling approaches that can handle behavioral dynamics and policy changes.
Methodology
SL-BiLEM employs a structured approach to epidemic modeling by decomposing effective transmission into multiple factors influenced by policy, media, and compliance dynamics. The model incorporates physical constraints to ensure that the learned compliance function adheres to realistic behavioral patterns, enhancing its predictive validity and robustness against distribution shifts.
Results
The model demonstrated a 76% improvement in forecasting accuracy compared to neural-mechanistic baselines, with only 53% out-of-distribution degradation versus 1142% for traditional neural models. It also achieved 100% bootstrap confidence interval coverage in synthetic counterfactual experiments and a treatment effect accuracy exceeding 0.85.
Implications
SL-BiLEM provides a novel framework for public health officials to make informed decisions regarding epidemic interventions. Its ability to accurately forecast and evaluate counterfactual scenarios can significantly enhance resource allocation and policy design during health crises.
SYNAPSE: Neuro-Symbolic Visual Thought-to-Text Decoding via Topological Semantic Denoising
NLP
Large Language Models
Multimodal
- Introduction of SYNAPSE, a lightweight neuro-symbolic framework for EEG-to-text decoding.
- Implementation of a graph purification mechanism to enhance semantic stability.
- Development of a latent exemplar retrieval strategy for improved text generation.
- Demonstrated robust performance across multiple EEG benchmarks and frozen LLMs.
Read more
SYNAPSE: Neuro-Symbolic Visual Thought-to-Text Decoding via Topological Semantic Denoising
Summary
The paper presents SYNAPSE, a novel neuro-symbolic framework designed to enhance the stability of EEG-to-text decoding by addressing the challenges posed by biological noise in neural signals. Traditional methods for translating EEG signals into coherent text often suffer from inaccuracies due to noise, leading to hallucinations or semantically unstable outputs. SYNAPSE introduces a symbolic regularization mechanism that purifies EEG-derived semantic candidates using a commonsense graph structure, thereby improving semantic stability without the need for end-to-end fine-tuning of large language models (LLMs). The framework includes a graph purification stage that filters out disconnected semantic noise while retaining high-priority neural targets. Additionally, it employs a latent exemplar retrieval strategy to inject syntactic templates, further stabilizing the text generation process. The authors evaluate SYNAPSE on two EEG decoding benchmarks, demonstrating significant performance improvements over existing methods, particularly in scenarios with noisy data. The results indicate that SYNAPSE achieves competitive decoding performance while ensuring biometric privacy by keeping raw EEG processing localized within the encoder stack.
Methodology
SYNAPSE employs a neuro-symbolic approach that integrates symbolic regularization during inference. It utilizes a commonsense multigraph for graph purification to filter out irrelevant semantic noise and retains high-confidence neural intents. The framework also incorporates a latent exemplar retrieval mechanism to provide syntactic templates, enhancing the stability of the generated text under noisy conditions.
Results
SYNAPSE showed clear improvements over unconstrained prompting and prior decoupled decoding methods on the CVPR2017 and THINGS EEG2 benchmarks. It achieved performance levels comparable to or exceeding those of more resource-intensive fine-tuned systems, particularly under conditions of aggressive object-label ablation, indicating enhanced semantic stability and robustness.
Implications
The findings suggest that SYNAPSE could be applied in assistive communication technologies, immersive interfaces, and neurological healthcare, providing a more reliable means of translating neural activity into coherent language while ensuring user privacy.
Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration
Time Series
- Introduces Under-Cali, a framework for online IMTS forecasting that adapts to distribution shifts.
- Utilizes an uncertainty estimator to guide the routing of samples to different calibration experts.
- Maintains a frozen source forecasting model while enabling lightweight, model-agnostic adaptation.
- Demonstrates significant performance improvements in dynamic scenarios with evolving data distributions.
Read more
Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration
Summary
This paper addresses the challenge of online forecasting for irregular multivariate time series (IMTS), which are critical in various applications but often suffer from performance degradation due to dynamic shifts in data distribution. The authors propose a novel framework called Under-Cali, which incorporates an uncertainty-driven dual-expert calibration approach. This framework consists of three main components: an uncertainty estimator that assesses the uncertainty of incoming data batches, a dual-expert calibration module that separates samples into reliable and unreliable pathways based on their uncertainty, and an adaptive routing module that determines when to adapt the calibration experts. By focusing on uncertainty as a key indicator of distribution shifts, Under-Cali allows for efficient online adaptation without the need to modify the source forecasting model. The proposed method demonstrates consistent improvements in forecasting accuracy across various IMTS benchmarks while maintaining low computational costs.
Methodology
The Under-Cali framework consists of an uncertainty estimator that evaluates the uncertainty of incoming samples, a dual-expert calibration module that processes samples through reliable and unreliable pathways based on their uncertainty, and an adaptive routing module that autonomously decides when to adapt the calibration experts. This approach allows the model to effectively manage inference and adaptation processes in real-time.
Results
Extensive experiments on IMTS benchmarks show that Under-Cali consistently outperforms existing methods in terms of forecasting accuracy while requiring minimal computational resources. The framework effectively adapts to dynamic shifts in data distribution, demonstrating its robustness in online learning scenarios.
Implications
The proposed framework has significant implications for real-world applications where irregular multivariate time series data is prevalent, such as healthcare, finance, and climate monitoring. By enabling efficient online adaptation, Under-Cali can improve decision-making processes in environments characterized by rapidly changing data patterns.
Dimensionality Reduction for Robust Federated Learning: A Theoretical Analysis and Convergence Guarantee
Federated Learning
Theory
Efficient ML
- Introduction of Projected Dimensionality Reduction (PDR) framework for efficient robust aggregation in FL.
- Theoretical convergence guarantees established for both non-convex and strongly convex functions.
- Significant reduction in server computational complexity to O(Mp), where M is the number of clients and p is the model dimension.
- Empirical results demonstrate orders of magnitude speedups in execution time while maintaining accuracy.
Read more
Dimensionality Reduction for Robust Federated Learning: A Theoretical Analysis and Convergence Guarantee
Summary
This paper addresses the computational challenges faced by Federated Learning (FL) systems, particularly in the context of Byzantine attacks, which can compromise the integrity of model training. The authors propose a novel framework called Projected Dimensionality Reduction (PDR) that significantly reduces the computational burden associated with robust gradient aggregation. By employing sparse random projection, PDR compresses high-dimensional gradients into a lower-dimensional subspace, allowing for efficient computation of reliability weights necessary for robust aggregation. The authors establish theoretical convergence guarantees for PDR, demonstrating optimal convergence rates for both non-convex and strongly convex functions. Experimental results on benchmark datasets, including TinyImageNet and CIFAR, show that PDR achieves substantial speedups in server execution time while maintaining competitive accuracy and effectively mitigating Byzantine threats. This work bridges the gap between dimensionality reduction techniques and robust aggregation in FL, providing a scalable solution to enhance the efficiency of federated systems.
Methodology
The PDR framework utilizes sparse random projection to compress gradients into a lower-dimensional subspace before performing robust aggregation. This approach leverages the Subspace Embedding Theorem to ensure that the relative distances among client updates are preserved, allowing for efficient computation of reliability weights in the reduced space.
Results
The implementation of PDR resulted in significant speedups in server execution time, achieving optimal convergence rates of O(1/βT) for non-convex functions and O(1/T) for strongly convex functions. The method effectively neutralized Byzantine threats in non-IID settings while maintaining competitive accuracy across various benchmark datasets.
Implications
The findings suggest that PDR can enhance the scalability and efficiency of federated learning systems, making them more robust against Byzantine attacks without incurring prohibitive computational costs. This has potential applications in privacy-sensitive domains where federated learning is critical.
Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior
NLP
Large Language Models
Efficient ML
- Introduction of Latent Recurrent Transformer (LRT) that reuses previous token hidden states as recurrent memory.
- Development of interleaved parallel training to efficiently pretrain LRT without sacrificing parallelism.
- LRT maintains standard transformer architecture while enhancing context utilization through recurrent pathways.
- Empirical results show improved language modeling and in-context learning with minimal parameter increase.
Read more
Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior
Summary
The paper introduces the Latent Recurrent Transformer (LRT), a novel architecture that enhances autoregressive transformers by incorporating a lightweight recurrent memory mechanism. This mechanism reuses high-level hidden states from previous tokens as memory for subsequent tokens, allowing for improved context utilization without increasing computational depth or introducing additional decoding steps. The authors propose a training strategy called interleaved parallel training, which enables efficient pretraining by refining disjoint subsets of token positions while maintaining parallelism. The LRT architecture preserves the standard attention mechanism and key-value cache interface, facilitating a seamless integration of recurrent memory. Empirical evaluations demonstrate that LRT significantly improves language modeling loss and in-context learning performance across various model sizes and parameter budgets, achieving better compute-quality trade-offs with minimal parameter increase. The findings suggest that using high-level representations as memory can enhance the performance of autoregressive models while keeping computational costs manageable.
Methodology
The authors propose the LRT architecture, which integrates a recurrent memory mechanism by reusing hidden states from previous tokens. They introduce interleaved parallel training to efficiently pretrain the model without sequentially unrolling the transformer, allowing for parallel refinement of token subsets. This approach preserves the standard transformer structure and enhances the model's ability to leverage high-level information across token positions.
Results
LRT demonstrates significant improvements in language modeling loss and in-context learning performance compared to baseline autoregressive transformers, achieving better scaling behavior and compute-quality trade-offs. The default shared-projection variant of LRT adds only 0.3% to the model's parameters while enhancing performance across various configurations.
Implications
The findings suggest that LRT can be effectively applied to improve the performance of autoregressive language models, making it a valuable approach for tasks requiring efficient context utilization and enhanced learning capabilities. This could have implications for various NLP applications, including text generation, dialogue systems, and more.
Symbolic Regression via Latent Iterative Refinement
Interpretability
Optimization
Theory
- Introduction of Latent Equation Embedding (LEE) for symbolic regression.
- Iterative amortized inference closes the gap between one-shot predictions and true posteriors.
- LEE produces significantly simpler expressions compared to existing methods.
- Combines iterative refinement with continuous gradient descent for enhanced robustness.
Read more
Symbolic Regression via Latent Iterative Refinement
Summary
This paper introduces a novel framework for symbolic regression (SR) called Latent Equation Embedding (LEE), which aims to find closed-form mathematical expressions that fit observed data. Traditional neural SR methods often suffer from an 'amortization gap' due to their one-shot predictions, which do not accurately reflect the true posterior. LEE addresses this issue by employing iterative amortized inference in a functionally-grounded latent space. The framework consists of three main components: an encoder that embeds symbolic tokens and numerical observations into a shared latent vector, an expression decoder that reconstructs formulas from this latent vector, and an evaluation decoder that predicts function values, grounding the latent space in functional behavior. During inference, LEE iteratively refines the latent representation by re-encoding decoded expressions alongside observations, effectively closing the amortization gap. The methodology combines iterative refinement with continuous gradient descent, resulting in a more robust search process. Experimental results on the SRBench benchmark demonstrate that LEE produces expressions that are 2-10 times simpler than the best existing accuracy-oriented baselines, marking a significant advancement in the accuracy-complexity Pareto frontier.
Methodology
LEE employs a shared latent space with an encoder for embedding symbolic and numerical data, an expression decoder for reconstructing formulas, and an evaluation decoder for predicting function values. The inference process involves iterative refinement of the latent representation by re-encoding decoded expressions with observations, interleaved with continuous gradient descent to optimize the search process.
Results
On the SRBench benchmark, LEE achieved expressions that were 2-10 times simpler than the strongest accuracy-oriented baselines, demonstrating its effectiveness in producing interpretable mathematical expressions while maintaining high accuracy.
Implications
The findings suggest that LEE can significantly enhance the interpretability of models in symbolic regression tasks, making it a valuable tool for applications requiring both accuracy and simplicity in mathematical modeling.
SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
Theory
Interpretability
Efficient ML
- SilIF enhances Isolation Forest by adding a silhouette-based scoring layer.
- Demonstrated a statistically significant improvement in fraud detection performance on the IEEE-CIS benchmark.
- Identified conditions under which the silhouette augmentation is effective or ineffective.
- Provides open-source code for reproducibility of experiments.
Read more
SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
Summary
This paper introduces SilIF, a novel augmentation of the Isolation Forest (IF) algorithm designed for unsupervised transaction fraud detection. The primary challenge in fraud detection is the scarcity of labeled data, making unsupervised methods essential. While IF is a widely used method due to its scalability and simplicity, it primarily relies on a single scalar score that summarizes anomaly detection, potentially overlooking valuable structural information from the trees. SilIF addresses this by incorporating a silhouette-based scoring mechanism that evaluates how well a transaction fits within its assigned structural group compared to others. The silhouette score is computed from the path lengths of each transaction across the trees, creating a 'fingerprint' representation. This additional scoring layer is combined with the base IF score using a hyperparameter, Ξ±. The authors validate SilIF on the IEEE-CIS Fraud Detection benchmark, demonstrating an average improvement of +0.0080 AUC-PR over standard IF across multiple seeds, with statistical significance. However, they also report that SilIF does not consistently outperform IF on a synthetic dataset, highlighting the conditions under which the augmentation is beneficial. The paper emphasizes the importance of honest reporting in machine learning enhancements and provides open-source code for reproducibility.
Methodology
SilIF utilizes the Isolation Forest algorithm to generate an ensemble of trees, from which it derives per-tree path lengths for each transaction. These path lengths are clustered to form structural groups, and a silhouette score is computed to assess the fit of each transaction within its group. This score is then combined with the base Isolation Forest score using a hyperparameter, Ξ±, allowing for a tunable enhancement.
Results
On the IEEE-CIS Fraud Detection benchmark, SilIF with Ξ± = 1.0 achieved an average improvement of +0.0080 AUC-PR compared to plain Isolation Forest, with statistical significance (p = 0.046). The method outperformed other techniques like HBOS and ECOD. However, on the synthetic Sparkov dataset, the silhouette augmentation did not yield improvements, indicating variability in effectiveness based on dataset characteristics.
Implications
SilIF presents a promising enhancement to existing anomaly detection methods, particularly in the financial sector for fraud detection. Its tunable nature allows practitioners to adapt the method to specific datasets and conditions, while the transparent reporting of results aids in understanding when such enhancements are beneficial.
Machine Learning methods for event classification and vertex reconstruction of the 12C + 12C reaction with the MATE-TPC
Computer Vision
- Machine learning techniques can effectively classify nuclear reaction events.
- High classification accuracy achieved with deep learning models (up to 97% for simulated data).
- Successful identification of misclassified events compared to traditional methods.
- Development of a CNN model for accurate reaction vertex reconstruction.
Read more
Machine Learning methods for event classification and vertex reconstruction of the 12C + 12C reaction with the MATE-TPC
Summary
This paper presents the application of machine learning techniques to analyze data from the 12C + 12C fusion reaction using the MATE-TPC (multi-purpose active-target time projection chamber). The authors employed various deep learning models, including Residual Neural Networks (ResNet-50, ResNet-34, ResNet-18) and VGG-19, to classify elastic scattering and fusion reaction events. The models achieved high classification accuracies of approximately 97% for simulated data and 90% for experimental data, successfully identifying events misclassified by traditional methods. Additionally, the models were utilized to classify events from different fusion reaction channels with an accuracy of around 95% on simulated data. A Convolutional Neural Network (CNN) was also developed for reaction vertex reconstruction, demonstrating an effective alternative strategy. The findings indicate that machine learning can significantly enhance the classification of nuclear reaction events and improve vertex reconstruction, facilitating future analyses of complex nuclear reaction data.
Methodology
The study utilized deep learning models, specifically Residual Neural Networks (ResNet variants) and VGG-19, for event classification, and a Convolutional Neural Network (CNN) for vertex reconstruction. The models were trained on data from the MATE-TPC, focusing on distinguishing between elastic scattering and fusion reaction events.
Results
The machine learning models achieved classification accuracies of approximately 97% for simulated data and 90% for experimental data. They also classified different fusion reaction channels with about 95% accuracy on simulated data. The CNN model for vertex reconstruction provided an alternative and effective strategy for this task.
Implications
The successful application of machine learning techniques in this study suggests that they can significantly enhance the analysis of nuclear reaction data, particularly in identifying complex event types and improving reconstruction methods. This could lead to more efficient data processing in future nuclear physics experiments.
Gradient Transformer: Learning to Generate Updates for LLMs
NLP
Large Language Models
Efficient ML
- Introduces the first data-free weak-to-strong knowledge distillation method for LLMs.
- Develops GRAD-TRANSFORMER, a Transformer-based model that generates LLM update vectors from TinyLM updates.
- Demonstrates superior performance compared to state-of-the-art knowledge distillation methods.
- Enables privacy-preserving fine-tuning of LLMs without accessing sensitive data.
Read more
Gradient Transformer: Learning to Generate Updates for LLMs
Summary
The paper addresses the challenge of fine-tuning large language models (LLMs) on private data, which is often infeasible for organizations with limited computational resources. The authors propose a novel data-free knowledge distillation framework called Gradient Transformer (GRAD-TRANSFORMER), which generates update vectors for LLMs based on fine-tuned TinyLMs. This approach captures the cumulative gradient effects during fine-tuning without requiring access to private data. The GRAD-TRANSFORMER utilizes a Transformer-based encoder-decoder architecture to transform TinyLM update vectors into LLM update vectors, facilitating multi-organization collaboration while maintaining privacy. Extensive experiments demonstrate that GRAD-TRANSFORMER significantly outperforms existing knowledge distillation methods, achieving a notable average performance gain across various language modeling and reasoning tasks, even under strict differential privacy constraints.
Methodology
The authors developed GRAD-TRANSFORMER, which segments the parameters of fine-tuned TinyLMs into block-wise update vectors. These vectors are projected into a common hidden state dimension, allowing the model to autoregressively generate corresponding update vectors for the target LLMs. The training of GRAD-TRANSFORMER is based on a curated dataset of update vector tuples that capture the correlation between TinyLM and LLM updates.
Results
GRAD-TRANSFORMER achieved an average performance gain of 55.89% over the best-performing baseline, with an average PGR of 91.88% across six benchmark datasets. The model maintained competitive performance even under strict differential privacy protections applied to the TinyLMs.
Implications
This work has significant implications for organizations looking to leverage LLMs while adhering to privacy constraints. It enables efficient model updates without the need for data sharing, fostering collaboration among organizations and enhancing the utility of LLMs in sensitive applications.
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
NLP
Large Language Models
Efficient ML
- Introduction of Dense2MoE, a unified framework combining pruning and upcycling for on-device LLMs.
- Utilization of Layer-Fusion UpCycling (LF-UC) to maintain model accuracy while reducing redundancy.
- Demonstrated significant improvements in inference latency versus model accuracy, pushing the Pareto frontier.
- Validated across multiple benchmarks and model scales, confirming broad applicability.
Read more
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
Summary
The paper presents Dense2MoE, a novel framework that integrates structured layer pruning and Mixture of Experts (MoE) upcycling to enhance the efficiency of on-device large language models (LLMs). Traditional methods of converting dense models to MoEs often lead to parameter redundancy, which negatively impacts inference efficiency. Dense2MoE addresses this issue by employing Layer-Fusion UpCycling (LF-UC), which prunes bandwidth-heavy attention modules while repurposing the Multi-Layer Perceptrons (MLPs) from pruned layers into MoE experts. This approach preserves the model's core capabilities and optimizes the number of active parameters through selective token routing. The framework is guided by hardware Roofline theory to effectively reduce memory bandwidth consumption. With minimal continual pre-training, Dense2MoE transforms dense LLMs into efficient MoE models suitable for resource-constrained environments. Experimental results demonstrate that Dense2MoE significantly improves the trade-off between inference latency and model accuracy, outperforming existing dense baselines and state-of-the-art compression methods across various benchmarks.
Methodology
Dense2MoE employs a dual-constraint similarity metric to identify redundant decoder layers, selectively pruning bandwidth-heavy attention modules while preserving MLPs. The LF-UC technique integrates these MLPs as heterogeneous experts into retained layers, optimizing both inference efficiency and model accuracy. The framework is grounded in hardware Roofline theory to systematically address memory bandwidth challenges.
Results
Dense2MoE was evaluated across nine diverse benchmarks, showing significant advancements in the Pareto frontier of inference latency versus model accuracy. It outperformed dense baselines, state-of-the-art compression techniques, and standard upcycling methods, demonstrating robust scalability across various model sizes and architectures.
Implications
The Dense2MoE framework has the potential to facilitate the deployment of large language models in resource-constrained environments, such as autonomous driving and edge AI, by providing efficient model architectures that maintain high performance without excessive computational costs.
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
Efficient ML
- LEARNWEAK automates domain specialization for small CUAs, addressing performance gaps with large models.
- The framework synthesizes targeted training data based on identified weaknesses, eliminating the need for manual annotation.
- An error-aware specialization objective allows for precise updates by distinguishing between planning and execution errors.
- LEARNWEAK outperforms existing models, achieving significant performance improvements across multiple domains.
Read more
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
Summary
This paper presents LEARNWEAK, an innovative framework designed for the automated domain specialization of small computer-use agents (CUAs). The authors highlight the challenges faced by small CUAs, which, despite their practicality, often exhibit significant performance gaps compared to larger proprietary models. Traditional methods of enhancing performance through large-scale data synthesis have proven insufficient. LEARNWEAK addresses this by employing a reference agent to identify weaknesses in the target domain, enabling the synthesis of targeted tasks without the need for manual annotation. The framework introduces an error-aware specialization objective that differentiates between planning and execution errors, allowing for more precise updates to agent behavior. Evaluations on the OSWorld benchmark demonstrate that LEARNWEAK significantly improves performance, achieving average gains of 11.6 and 11.1 percentage points over existing models. The results underscore the importance of student-awareness in both data generation and training, paving the way for more efficient specialization of small CUAs across diverse domains.
Methodology
LEARNWEAK employs a two-stage approach: (1) dataset generation through an annotation-free pipeline that utilizes teacher-student comparisons to identify weaknesses and synthesize queries, and (2) agent training with an error-aware preference optimization that adapts the training objective based on the type of errors encountered (planning vs. execution).
Results
The framework was evaluated across eight domains in the OSWorld benchmark, resulting in average performance improvements of 11.6 percentage points over EvoCUA-8B and 11.1 percentage points over OpenCUA-7B. Specialized small agents even surpassed their reference teachers in several domains, demonstrating the effectiveness of the proposed methods.
Implications
The findings suggest that LEARNWEAK could lead to more efficient and targeted specialization of small CUAs, making them more competitive with larger models. This has potential applications in various fields where small, efficient agents are needed, such as edge computing and privacy-sensitive environments.
Function-Valued Causal Influence in Nonlinear Time Series
Time Series
- Scalar edge scores in nonlinear causal models can misrepresent the complexity of causal relationships.
- Function-valued causal influence provides a more accurate representation of state-dependent effects.
- The proposed framework allows for direct estimation of causal response functions from trained models.
- Synthetic experiments demonstrate that qualitatively different causal mechanisms can yield similar scalar scores.
Read more
Function-Valued Causal Influence in Nonlinear Time Series
Summary
This paper addresses the limitations of scalar edge scores in representing causal relationships in nonlinear time series models. The authors argue that scalar summaries obscure the true nature of causal influence, which is inherently state-dependent and can vary across different contexts. They formalize the concept of function-valued causal influence for additive, contribution-decomposable autoregressive models, demonstrating that scalar scores create an information bottleneck by conflating between-state variation with within-state noise. The authors introduce a framework based on Individual Conditional Expectation (ICE) to estimate causal response functions directly from trained models. Through synthetic experiments, they show that edges with similar scalar scores can exhibit qualitatively different behaviors, such as monotonic or threshold effects. An applied case study on democratic development illustrates that function-valued analysis reveals regime-specific causal structures that scalar approaches miss, highlighting the need for more nuanced interpretations of causal relationships in complex systems.
Methodology
The authors formalize function-valued causal influence for a class of additive autoregressive models and introduce a framework based on Individual Conditional Expectation (ICE) to estimate causal response functions. They conduct controlled synthetic experiments and apply their methodology to a panel dataset on democratic development.
Results
The study shows that scalar causal scores can mask important differences in causal mechanisms, leading to misleading interpretations. The function-valued analysis reveals distinct causal behaviors that are regime-dependent and asymmetric, which are not captured by scalar summaries.
Implications
This work suggests that adopting function-valued causal influence can enhance the interpretability of causal relationships in nonlinear time series, particularly in fields like social sciences where understanding the mechanisms of influence is crucial for policy-making and theoretical development.
Balancing Plasticity and Stability with Fast and Slow Successor Features
Reinforcement Learning
Robotics
Theory
- Introduces a continual RL setup with smooth, continuous non-stationarity.
- Demonstrates that performance degradation is primarily due to instability rather than insufficient plasticity.
- Proposes a framework integrating Successor Features with multi-timescale synaptic consolidation.
- Uses cross-attention over SFs to interpret the contributions of stability and plasticity across timescales.
Read more
Balancing Plasticity and Stability with Fast and Slow Successor Features
Summary
This paper addresses the challenge of adapting deep reinforcement learning (RL) agents to non-stationary environments, which evolve gradually rather than through abrupt changes. The authors introduce a continual RL evaluation protocol that simulates naturalistic, continuous non-stationarity using stochastic drift processes in 3D Miniworld and MuJoCo environments. They investigate the balance between stability and plasticity, finding that methods emphasizing stability, such as synaptic consolidation, outperform those focused solely on plasticity, like parameter resetting. The study further explores the use of Successor Features (SFs) as a more effective target for consolidation compared to traditional Q-values. By applying neuro-inspired synaptic consolidation to SFs across multiple timescales, the authors demonstrate improved performance in environments with gradual changes. The findings suggest that prioritizing stability is crucial in continual learning settings characterized by gradual environmental shifts, and that multi-timescale consolidation of predictive representations can enhance adaptability while mitigating interference.
Methodology
The authors modified existing environments to incorporate continuous non-stationarity and systematically compared mechanisms that enhance plasticity against those that preserve stability. They utilized Successor Features for predictive representation and applied synaptic consolidation across multiple timescales to evaluate performance in changing environments.
Results
The study found that stability-focused methods significantly outperformed plasticity-focused methods in environments with gradual changes. The integration of SFs with multi-timescale consolidation led to superior performance, highlighting the importance of stability in continual learning.
Implications
The findings have implications for the design of RL algorithms that can effectively learn in real-world environments characterized by gradual changes, suggesting that prioritizing stability and using multi-timescale representations can enhance adaptability and performance.
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
NLP
Large Language Models
Efficient ML
- Integration of exponentially decaying memory into the RAT+ attention framework improves query-aware sparse inference.
- RAT+ consistently enhances accuracy over standard attention mechanisms across various sparse budgets.
- Significant performance improvements were observed in tasks using Quest, MoBA, and SnapKV with the memory-augmented architecture.
- Two hypotheses are proposed to explain the benefits of the memory module: improved critical-token selection and enhanced information retention.
Read more
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
Summary
This paper explores the enhancement of query-aware sparse inference methods through the integration of exponentially decaying memory into the RAT+ attention framework. The authors build upon their previous work, RAT+, which introduced a recurrence-augmented attention mechanism that allows for flexible dilated attention during inference. The study investigates whether this memory augmentation can also enhance existing query-aware sparse inference techniques, specifically focusing on three representative methods: Quest, MoBA, and SnapKV. The authors demonstrate that the RAT+ architecture consistently outperforms standard attention mechanisms across various sparse budgets in eight challenging needle-in-a-haystack tasks. They validate their findings using both the released RAT+ checkpoints and the OLMo2-7B model, which was further pretrained with the memory module. The results indicate significant improvements in accuracy, with SnapKV showing notable gains under reduced budgets. The paper also proposes two hypotheses regarding the benefits of the memory module: it enhances critical-token selection accuracy and provides an additional information pathway for selected candidates, both of which are supported by targeted experiments.
Methodology
The authors employed a recurrence-augmented attention mechanism (RAT+) that utilizes exponentially decaying memory to enhance standard attention. They evaluated the performance of this architecture against three query-aware sparse inference methods (Quest, MoBA, SnapKV) on needle-in-a-haystack tasks, comparing the results with those obtained from standard attention mechanisms. The study involved targeted experiments to validate the proposed hypotheses regarding the memory module's benefits.
Results
The results showed that the RAT+ architecture significantly improved the accuracy of the evaluated sparse inference methods. For instance, SnapKV achieved an average improvement of 34.11 and 40.03 points across eight tasks under 1/4 and 1/8 budgets, respectively. Additionally, Quest's accuracy increased from 68.0 to 98.6 on MK-2 under a 1/16 budget, while MoBA improved from 53.6 to 94.8 on MK-3.
Implications
The findings suggest that augmenting attention mechanisms with memory can lead to more efficient inference in long-context language models, potentially enabling better performance in real-world applications where computational resources are limited. This approach may also inspire future research on designing architectures that inherently support sparse inference.
Semi-Supervised Hypothesis Testing by Betting on Predictions
Theory
- Introduces a testing-by-betting framework for semi-supervised hypothesis testing.
- Develops an e-statistic to construct valid sequential tests under label and concept shifts.
- Demonstrates that the proposed tests retain validity and power despite inaccurate predictions.
- Shows significant power gains in simulations and applications to large language models.
Read more
Semi-Supervised Hypothesis Testing by Betting on Predictions
Summary
This paper introduces a novel framework for semi-supervised hypothesis testing that utilizes predictions from unlabeled data to improve the efficacy of sequential hypothesis testing. The authors propose a testing-by-betting approach, where predictions on unlabeled samples enhance the power of tests concerning the distribution of Y and the conditional distribution Y|X. The framework is built around an e-statistic that facilitates the construction of valid sequential tests under conditions of label shift or concept shift. The authors demonstrate that their tests maintain validity and power even when the predictions are inaccurate. Through simulations and applications to large language model evaluations, they show significant power gains over traditional methods, even with limited unlabeled data and low prediction accuracy. The paper emphasizes the importance of leveraging unlabeled data to enhance hypothesis testing in various applications, particularly in scenarios where labeled data is scarce.
Methodology
The authors construct a sequential testing framework that incorporates predictions from a predictive model fitted on labeled data. They formalize conditions under which conclusions about Y and Y|X can be drawn from both labeled and unlabeled samples. The e-statistic is used to create valid tests that control type-I error dynamically as data is collected.
Results
The proposed tests exhibit anytime validity and demonstrate non-trivial power, particularly in binary data scenarios. The simulations reveal that the tests outperform baseline approaches, including prediction-powered inference, even with limited unlabeled data and low prediction accuracy.
Implications
This work has significant implications for statistical inference in machine learning, particularly in fields where labeled data is expensive or difficult to obtain. It suggests that leveraging unlabeled data can enhance hypothesis testing power, which is crucial for applications like model evaluation and clinical trials.
Transfer Learning using 66 Diseases for Disease Forecasting Applications
Time Series
- Incorporating multiple data streams significantly enhances disease forecasting accuracy.
- The study compiles a publicly-available database of infectious disease data spanning 66 diseases.
- Data quality is crucial; irrelevant data can degrade forecasting performance.
- The research categorizes data inputs into four classes to assess their impact on model performance.
Read more
Transfer Learning using 66 Diseases for Disease Forecasting Applications
Summary
This paper explores the application of transfer learning in disease forecasting by leveraging data from 66 infectious diseases. Traditional forecasting models often rely on a single data stream, which can lead to poor performance in data-sparse or noisy conditions. The authors investigate the benefits of incorporating multiple data streams, demonstrating that this approach improves forecasting accuracy in 84.9% of cases analyzed. They compile a comprehensive database of infectious disease data, which is publicly available for the forecasting community. The study emphasizes the importance of data quality, noting that irrelevant or vastly different data can negatively impact model performance. The authors categorize their analysis into four classes of data input, ranging from single data streams to all available data across diseases and transmission modes, and evaluate the performance of various forecasting models across 20 different disease data streams.
Methodology
The authors compiled and cleaned a large dataset of infectious disease case, death, and hospitalization data from 66 diseases across 13 data sources. They categorized the data into four classes based on the type of input used for training machine learning models. The performance of various forecasting models was then evaluated across 20 distinct disease data streams.
Results
The study found that incorporating additional data streams improved forecasting accuracy in 84.9% of the time series and model structures analyzed. However, it also highlighted instances where adding dissimilar data could lead to negative transfer, adversely affecting model performance.
Implications
The findings suggest that leveraging transfer learning and diverse data sources can significantly enhance the robustness of disease forecasting models. This approach could be applied to improve public health responses and resource allocation during infectious disease outbreaks.
Model Merging on Loss Landscape: A Geometry Perspective
Theory
Optimization
Efficient ML
- EpiMer introduces a geometry-based approach to model merging, addressing the limitations of existing flat-geometry methods.
- The framework utilizes the FrΓ©chet mean on a Riemannian manifold, focusing on a low-rank subspace to simplify computations.
- Theoretical analysis provides insights into the conditions under which curvature-aware merging is beneficial.
- Empirical results show that EpiMer outperforms traditional merging methods across various tasks.
Read more
Model Merging on Loss Landscape: A Geometry Perspective
Summary
This paper introduces EpiMer, a novel framework for model merging that addresses the limitations of existing methods by incorporating the geometry of the loss landscape. Traditional merging techniques often overlook the curvature of the loss landscape or rely on impractical full-space Hessian approximations. EpiMer formulates model merging as finding the FrΓ©chet mean on a Riemannian manifold, utilizing the expected Hessian as the metric and focusing on a low-rank subspace defined by task vectors. The authors provide a theoretical analysis that decomposes the merging error into subspace FrΓ©chet variance and residual energy, establishing conditions under which curvature-aware merging outperforms flat-geometry methods. The framework also unifies various merging strategies, including curvature-aware and spectral methods, under a common geometric perspective. Empirical results demonstrate that EpiMer significantly enhances performance when merging fine-tuned CLIP-ViT models across multiple image classification tasks, outperforming baseline methods in terms of average and worst-task accuracy.
Methodology
EpiMer casts model merging as the problem of finding the FrΓ©chet mean on a Riemannian manifold, using the expected Hessian as the metric. It restricts computations to a low-rank subspace spanned by task vectors, allowing for practical curvature-aware merging. The authors also conduct a theoretical analysis to derive error bounds and conditions for effective merging.
Results
The experiments demonstrate that EpiMer consistently outperforms baseline merging methods on all three CLIP-ViT backbones across eight image classification tasks, improving both average and worst-task accuracy.
Implications
The proposed framework has significant implications for model integration in machine learning, particularly in scenarios where multiple task-specific models need to be merged efficiently without retraining. It enhances the understanding of how geometric properties of loss landscapes affect model performance.
Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector
Time Series
- Introduces a unified benchmarking study for MTS anomaly detection methods.
- Evaluates ten detectors across five datasets under consistent protocols.
- Finds that no single inductive bias dominates across all datasets.
- CCG-MSD achieves the best performance and robustness metrics.
Read more
Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector
Summary
This paper presents a comprehensive benchmarking study of multivariate time-series (MTS) anomaly detection methods, evaluating ten representative detectors across five datasets (SMD, MSL, SMAP, PSM, and MSDS). The study aims to provide empirical guidance on the effectiveness of various inductive biases for different MTS workloads. The authors introduce a new adaptive detector, CCG-MSD, which combines multiple views of data through a directed channel-graph approach. The benchmark reveals that no single inductive bias consistently outperforms others across all datasets, and it highlights the importance of using absolute perturbation VUS-ROC as a more informative metric than retention ratios. CCG-MSD achieves the highest macro-average VUS-ROC score of 0.675, outperforming other methods and demonstrating superior robustness across various perturbations. The paper also emphasizes the need for a unified evaluation protocol in MTS anomaly detection research and provides a reproducibility repository with all necessary resources for further research.
Methodology
The study evaluates ten family-representative MTS anomaly detectors, including statistical, reconstruction, association, frequency, and transformer-based methods. The evaluation is conducted under four dimensions: effectiveness, efficiency, robustness, and cross-dataset generalization, using identical training and evaluation protocols. The proposed CCG-MSD detector integrates a directed channel-graph view with additional patch-attention and temporal-association cues, adapting its configuration based on the dataset.
Results
CCG-MSD achieves a macro-average VUS-ROC of 0.675, ranking first overall and placing in the top-3 across all five datasets. It demonstrates significant robustness against various perturbations, outperforming other methods, particularly in noise, channel dropout, and time-shift scenarios. The benchmark also reveals that retention ratios can misrepresent robustness for weaker detectors.
Implications
The findings suggest that practitioners should carefully select anomaly detection methods based on the specific characteristics of their MTS workloads. The release of the benchmarking framework and datasets can facilitate further research and development in the field of MTS anomaly detection, promoting more effective and robust solutions.
Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models
Generative Models
Computer Vision
Efficient ML
- Introduction of a stage-wise framework for D-P traversal in zero-shot inverse problems using diffusion models.
- The MAP-RPS method combines MAP estimation with re-noised posterior sampling to balance distortion and perception.
- Extension to latent space with LMAP-RPS enhances applicability and efficiency.
- Theoretical analyses support the effectiveness of the proposed methods.
Read more
Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models
Summary
This paper addresses the distortion-perception (D-P) tradeoff in Bayesian inverse problems, which is crucial for balancing reconstruction fidelity and perceptual quality. The authors propose a novel stage-wise framework called MAP-RPS that utilizes a single diffusion model for zero-shot inverse problems. The framework consists of two stages: the first stage employs maximum a posteriori (MAP) estimation to provide a low-distortion initialization, while the second stage enhances perceptual quality through a re-noised posterior sampling strategy. The authors extend this approach to latent space, resulting in LMAP-RPS, which leverages pre-trained latent diffusion models for broader applicability. Theoretical analyses validate the effectiveness of both stages, and extensive experiments demonstrate that MAP-RPS and LMAP-RPS achieve superior D-P traversal across various tasks, outperforming existing methods and showcasing their potential as efficient solvers for real-world inverse problems.
Methodology
The proposed methodology consists of a two-stage framework: the first stage uses MAP estimation to approximate the MMSE solution, providing a low-distortion initialization. The second stage employs a re-noised posterior sampling strategy to progressively enhance perceptual quality. The approach is further extended to latent space, leveraging pre-trained diffusion models for broader applicability.
Results
The experimental results indicate that both MAP-RPS and LMAP-RPS closely approximate the ideal D-P curve across various image inverse problems. LMAP-RPS, in particular, demonstrates superior performance on near-real-world inverse problems, achieving better results with lower computational complexity compared to existing latent diffusion-based algorithms.
Implications
The findings suggest that the proposed stage-wise framework can significantly improve the efficiency and effectiveness of diffusion-based inverse algorithms, making it a valuable tool for practical applications in fields such as image restoration, medical imaging, and signal processing.
Thinned Mean Field Langevin Dynamics
Theory
Optimization
Efficient ML
- KT-MFLD reduces computational complexity from O(N^2) to O(N^(3/2)) by using a thinned particle coreset.
- The method retains convergence guarantees similar to traditional MFLD under mild conditions.
- Empirical validation shows KT-MFLD outperforms Random-MFLD and other coreset methods in multiple tasks.
- The approach is particularly beneficial for high-dimensional and multimodal distributions.
Read more
Thinned Mean Field Langevin Dynamics
Summary
This paper introduces a novel approach to Mean Field Langevin Dynamics (MFLD) called KT-MFLD, which addresses the computational inefficiency associated with simulating large particle systems in entropy-regularized learning tasks. Traditional MFLD requires O(N^2) computational complexity due to pairwise interactions among N particles, which becomes prohibitive for high-dimensional and multimodal distributions. The proposed KT-MFLD method reduces this complexity to O(N^(3/2)) by implementing a thinning technique that allows each particle to interact only with a selected subset of particles, significantly improving scalability. The authors provide a theoretical framework demonstrating that KT-MFLD maintains convergence guarantees comparable to MFLD under mild regularity conditions. Empirical results validate the effectiveness of KT-MFLD across various applications, including training neural networks and quantization tasks, showing superior performance compared to traditional MFLD and other acceleration methods.
Methodology
The authors propose KT-MFLD, which involves selecting a representative subset of particles at each iteration through a thinning process. This subset interacts with the remaining particles, reducing the computational burden. The selection is performed using a near-linear time algorithm called KT-SPLIT-Compress, which ensures that the accuracy of the particle approximation remains comparable to MFLD.
Results
The theoretical analysis shows that the finite-particle approximation error of KT-MFLD scales as O(N^(-1)(log N)^2), matching the approximation error of MFLD up to logarithmic factors. Empirical results demonstrate that KT-MFLD consistently outperforms Random-MFLD and other related methods across various tasks, confirming its effectiveness in practical applications.
Implications
The KT-MFLD approach has significant implications for scalable particle-based simulations in machine learning, particularly in fields requiring efficient computation for high-dimensional distributions. It can enhance the performance of neural networks, improve quantization methods, and facilitate more efficient inference in post-Bayesian frameworks.
$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference
NLP
Large Language Models
Generative Models
- E3-Agent features a dual-layer architecture for efficient resource management in edge environments.
- It adapts to non-stationary conditions through online learning from execution feedback.
- The system significantly reduces latency and improves performance stability compared to static resource management approaches.
- E3-Agent is designed to handle the complexities of heterogeneous devices and dynamic workloads in edge AI applications.
Read more
$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference
Summary
The paper introduces E3-Agent, an innovative resource management system designed for edge deployments of generative inference. Traditional resource managers often struggle with the non-stationary nature of edge environments, where per-device performance can vary due to factors such as user-driven events and device churn. E3-Agent addresses these challenges by implementing a dual-layer architecture: a fast-path router for immediate decision-making and a slow-path meta-controller that adapts to changing conditions through an explicit control interface. This design allows E3-Agent to learn from real-time execution feedback, continuously optimizing resource allocation and mitigating performance degradation. The authors evaluate E3-Agent using a discrete-event simulator based on MLPerf-derived metrics, demonstrating significant improvements in latency and stability across various dynamic scenarios, including semantic dynamics and device churn. The results indicate that E3-Agent can reduce average latency by 65%-73% compared to static baselines while maintaining performance close to an ideal oracle, showcasing its effectiveness in managing edge resources for generative AI applications.
Methodology
E3-Agent employs a dual-layer architecture consisting of a fast-path router for immediate decision-making and a slow-path meta-controller that adapts to changing conditions. The system learns online from execution feedback, allowing it to continuously optimize resource allocation. The evaluation is conducted using a discrete-event simulator informed by MLPerf-derived metrics, simulating various dynamic scenarios to assess performance.
Results
E3-Agent achieves a 65%-73% reduction in average latency compared to the best static baseline and maintains performance within 7%-10% of an online full-information oracle. It effectively suppresses stutter rates under conditions of semantic degradation, demonstrating its robustness in managing edge resources.
Implications
The findings suggest that E3-Agent can significantly enhance the efficiency and responsiveness of edge generative inference systems, making it a valuable tool for applications requiring real-time resource management in dynamic environments. Its design principles may also inform future developments in edge computing and AI-driven resource management.
Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
NLP
Large Language Models
Efficient ML
- Meta-Attention dynamically selects attention mechanisms for each token based on contextual demands.
- The Bayesian Meta-Controller uses a Dirichlet prior to inform routing decisions, improving compute-performance trade-offs.
- Empirical results show a 2.4Γ reduction in projected normalized FLOP cost compared to prior-free routing methods.
- The framework effectively prevents routing collapse while maintaining task performance.
Read more
Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
Summary
The paper introduces Meta-Attention, a novel framework designed to enhance the efficiency of transformer architectures by dynamically routing each token to the most suitable attention strategyβfull softmax attention, linear (kernel) attention, or sliding-window local attention. This is achieved through a Bayesian Meta-Controller that treats the selection of attention mechanisms as a posterior inference problem, guided by a compute-aware Dirichlet prior. The proposed method addresses the limitations of existing routing techniques, which often rely on deterministic or prior-free learned routing, by providing principled routing uncertainty estimates that facilitate a smooth transition from soft to hard routing. The framework is validated through empirical results on a Tiny LM benchmark, demonstrating significant reductions in computational costs and routing entropy compared to prior-free baselines. The paper also details the architecture of the Bayesian controller, the Evidence Lower Bound (ELBO) training objective, and presents a Phase 1 PyTorch prototype that confirms the correctness and effectiveness of the proposed approach.
Methodology
The methodology involves a Bayesian Meta-Controller that infers the optimal attention mechanism for each token based on a compute-aware Dirichlet prior. The routing weights are derived from an amortized variational posterior trained with an ELBO objective, which balances task performance and attention mechanism costs. The framework includes both soft and hard routing variants, with a focus on preventing routing collapse.
Results
The empirical validation on a Tiny LM benchmark indicates that the Bayesian controller achieves a projected normalized FLOP cost of 25.1% under hard routing, significantly lower than the 59.3% cost of the prior-free baseline. Additionally, routing entropy was reduced from 55.8% to 43.3%, demonstrating the effectiveness of the Dirichlet prior in preventing routing collapse.
Implications
The findings suggest that Meta-Attention can lead to more efficient transformer models, enabling better resource utilization in applications requiring large-scale language processing. This approach could be particularly beneficial in scenarios where computational budgets are constrained, such as mobile or edge devices.
SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings
Computer Vision
Theory
Efficient ML
- Extends theoretical analysis of optimal SSL representations from Euclidean to Riemannian manifolds.
- Proves that uniform distributions on the hypersphere are optimal for k-NN and kernel ridge regression.
- Introduces SUSReg, a projection-based regularizer that enforces hyperspherical uniformity.
- Demonstrates significant performance improvements over Gaussian-based regularization on standard benchmarks.
Read more
SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings
Summary
This paper addresses a critical gap in self-supervised learning (SSL) by exploring the optimal geometry of learned representations on lower-dimensional manifolds, specifically the hypersphere. Building on previous work by LeJEPA, which identified isotropic Gaussian embeddings as optimal in Euclidean spaces, the authors extend this analysis to Riemannian manifolds. They demonstrate that uniform distributions on these manifolds are optimal for k-nearest neighbors (k-NN) and kernel ridge regression, contrasting with Gaussian embeddings that induce anisotropic neighborhoods. To enforce this uniformity, the authors introduce SPHERE-JEPA, a new SSL framework that adapts the CramΓ©rβWold projection mechanism to promote hyperspherical uniformity instead of Gaussian priors. Empirical evaluations show that SPHERE-JEPA significantly improves performance on standard benchmarks, including a 1.8% accuracy gain on ImageNet-1K and over 6% improvement in texture retrieval tasks, while maintaining computational efficiency.
Methodology
The authors extend the minimax analysis of representation geometry to smooth distributions on Riemannian manifolds. They derive optimality principles for local and kernel-based learning methods, proving that uniform distributions are optimal for k-NN and kernel ridge regression. SPHERE-JEPA employs a projection-based regularization mechanism, SUSReg, to enforce uniformity on the hypersphere, adapting techniques from the CramΓ©rβWold theorem.
Results
SPHERE-JEPA shows consistent improvements over Gaussian-based regularization methods, achieving a 1.8% accuracy gain in linear probing on ImageNet-1K and over 6% improvement in mean average precision for texture retrieval tasks. The framework matches or outperforms existing SSL methods across various benchmarks.
Implications
The findings suggest that enforcing uniformity in learned representations can enhance the performance of SSL methods, particularly in nonparametric settings. This approach could lead to more effective representation learning in various applications, including computer vision and beyond.
Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization
Large Language Models
Theory
Efficient ML
- Generalization in neural networks is influenced by interaction efficiency, which can be optimized by adjusting the depth-width ratio.
- The concept of 'neural interaction' is introduced, extending superposition to gradient space and providing new metrics for quantifying interaction.
- An efficient interaction interval exists under fixed budgets, which remains stable as resource budgets scale up.
- Models closer to the efficient interaction interval demonstrate superior performance on benchmarks like MMLU-Pro.
Read more
Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization
Summary
This paper investigates the efficiency of resource utilization in large language models (LLMs) under fixed budgets, focusing on the relationship between model architecture (depth and width) and generalization performance. The authors extend the concept of superposition, previously defined in parameter space, to gradient space, introducing the notion of 'neural interaction.' By employing the Neural Feature Ansatz (NFA) and the Average Gradient Outer Product (AGOP), they define two metrics for quantifying interaction efficiency: absolute interaction energy (AOFE) and interaction contribution (AOFE-ratio). The study reveals that models with a favorable depth-width ratio (RD/W) exhibit better generalization by optimizing interaction efficiency. The authors propose the Law of Neural Interaction, asserting that generalization is influenced not only by model capacity but also by the efficiency of converting representational resources into reusable structures. Empirical results indicate that models operating within an efficient interaction interval tend to perform better on the MMLU-Pro benchmark, suggesting that tuning the depth-width shape is crucial for maximizing performance under fixed resource constraints.
Methodology
The authors utilize the Neural Feature Ansatz (NFA) and the Average Gradient Outer Product (AGOP) to analyze neural interactions in multilayer networks. They define and quantify interaction efficiency through two metrics: absolute interaction energy (AOFE) and interaction contribution (AOFE-ratio). The study involves systematic experimentation with various depth-width ratios across different network architectures to assess their impact on generalization performance under fixed resource budgets.
Results
The findings indicate that models with a high AOFE-ratio and low AOFE achieve better generalization, confirming the existence of an efficient interaction interval. Empirical comparisons of small dense LLMs reveal that those operating near this interval perform better on the MMLU-Pro benchmark, supporting the proposed Law of Neural Interaction.
Implications
This research provides insights into model architecture design for LLMs, emphasizing the importance of interaction efficiency in resource-constrained environments. It suggests that optimizing the depth-width shape can lead to improved generalization, which is crucial for developing more efficient and effective neural networks.
Convergence of Spectral Descent for Non-smooth Optimization
Optimization
Theory
Efficient ML
- Establishment of global linear convergence for Spectral Descent and Truncated Spectral Descent in non-smooth optimization.
- Introduction of a neighborhood-based subgradient selection mechanism to stabilize optimization trajectories.
- Theoretical guarantees for robust low-rank matrix recovery under mixed noise conditions.
- Empirical validation showing superior performance of Muon-type optimizers compared to traditional methods.
Read more
Convergence of Spectral Descent for Non-smooth Optimization
Summary
This paper investigates the convergence behavior of Spectral Descent (SD) and its truncated variant (TSD) in the context of non-smooth optimization, particularly focusing on their application to large language models (LLMs). The authors establish global linear convergence guarantees for both SD and TSD under conditions of convexity, Lipschitz continuity, and sharpness. They also explore the effects of decoupled weight decay, linking regularized spectral updates to the Conditional Subgradient (Frank-Wolfe) method, and provide sublinear convergence guarantees for regularized versions of these methods. The theoretical framework is applied to robust low-rank matrix recovery under mixed noise regimes, yielding rigorous recovery guarantees. Extensive numerical experiments validate the theoretical findings, demonstrating that Muon-type optimizers, including SD and TSD, significantly outperform traditional methods like gradient descent and Adam in non-smooth settings, particularly in terms of convergence speed and robustness to learning rate schedules.
Methodology
The authors analyze the convergence of SD and TSD under specific structural assumptions and introduce a neighborhood-based subgradient selection mechanism for optimization stability. They derive theoretical guarantees for both global linear and sublinear convergence, and apply their framework to robust low-rank matrix recovery problems.
Results
The paper presents rigorous global linear convergence results for SD and TSD under non-smooth conditions, as well as sublinear convergence guarantees for regularized variants. The empirical results demonstrate that Muon-type optimizers achieve faster convergence and improved robustness in training large language models compared to standard optimizers.
Implications
The findings suggest that Muon-type optimizers can be effectively utilized in training large-scale models, particularly in scenarios involving non-smooth optimization problems. This could lead to more efficient training processes and better generalization in various machine learning applications.
Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining
Time Series
- PTCD is a pretraining framework that enhances causal discovery in time series data.
- It utilizes a dual-scale iterative attention mechanism to model complex temporal dependencies.
- The framework incorporates intervention-based learning to break spurious correlations.
- PTCD shows superior performance on multiple real-world out-of-distribution datasets.
Read more
Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining
Summary
This paper addresses the challenge of causal discovery in time series data, which is essential for applications like anomaly detection and system failure analysis. The authors introduce PTCD, a novel pretraining framework that enhances cross-task generalization through context-conditioned modeling and transferable causal augmentation. PTCD employs a dual-scale iterative attention mechanism to capture complex temporal causal dependencies and a Gaussian mixture model with context-level routing to manage heterogeneous exogenous distributions. To tackle distribution shifts across causal graphs, PTCD utilizes a pretraining paradigm on synthetic datasets that incorporates intervention-based learning and a causal mixup strategy. This approach promotes stable causal discovery and improved generalization across various real-world out-of-distribution datasets. The experiments demonstrate that PTCD significantly outperforms existing methods in both causal discovery and root cause identification, showcasing its effectiveness in adapting to new time series governed by diverse causal mechanisms.
Methodology
The methodology includes a hierarchical context-conditioned modeling approach that captures both intra-window and inter-window dependencies in time series. It employs a dual-scale iterative attention mechanism for representation enhancement and a context-conditioned Gaussian mixture model for exogenous variable estimation. The pretraining strategy integrates intervention-based tasks and causal mixup to improve generalization and robustness against distribution shifts.
Results
Extensive experiments on various real-world datasets indicate that PTCD consistently outperforms existing causal discovery methods, demonstrating its effectiveness in accurately identifying causal relationships and root causes in time series data.
Implications
The findings suggest that PTCD can be applied in diverse fields requiring causal analysis from time series data, such as finance, environmental monitoring, and system diagnostics, enabling better decision-making and understanding of complex systems.
Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity
Reinforcement Learning
Federated Learning
Robotics
- Introduction of Personalized Observation Normalization (PON) for FedRL to address input distribution heterogeneity.
- Demonstration that shared normalization parameters are ineffective due to diverse local distributions.
- PON utilizes continuously updated statistics for local normalization, improving consistency across agents.
- Experimental results show PON accelerates training and outperforms baseline methods in heterogeneous environments.
Read more
Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity
Summary
This paper addresses the challenges faced by Federated Reinforcement Learning (FedRL) in heterogeneous environments, where agents experience diverse state-transition dynamics leading to non-identical input distributions and imbalanced parameter updates during aggregation. The authors propose a novel method called Personalized Observation Normalization (PON), which allows each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This approach ensures consistent scaling of local features without overshadowing across agents during the aggregation process. The study demonstrates that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, underscoring the necessity for personalized statistics. Experiments conducted on heterogeneous MuJoCo tasks reveal that PON significantly accelerates training and achieves superior performance compared to baseline methods, thereby enhancing the effectiveness of FedRL in practical applications.
Methodology
The authors developed the Personalized Observation Normalization (PON) method, which allows agents to locally normalize their state inputs using a continuously updated running mean and variance. This method was tested in heterogeneous environments using MuJoCo tasks, where agents interacted with distinct environments, and the performance was compared against baseline FedRL methods.
Results
The experimental results indicated that the PON method significantly accelerated the training process and achieved better performance metrics compared to traditional baseline methods in heterogeneous environments. This improvement was attributed to the effective handling of diverse input distributions through personalized normalization.
Implications
The findings suggest that personalized observation normalization can enhance the performance of federated reinforcement learning systems, making them more applicable in real-world scenarios where data privacy and heterogeneous environments are critical. This approach could be particularly beneficial in domains such as autonomous driving, IoT systems, and other applications requiring collaborative learning without data sharing.
Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning
Reinforcement Learning
Robotics
- Introduces a novel method to bridge the imitation gap between RL-based teacher and IL-based student policies.
- Utilizes a shared embedding space to hide private information from the teacher, enabling direct imitation by the student.
- Demonstrates improved student performance with reduced imitation gap across multiple challenging environments.
- The method is adaptable to existing frameworks with minimal hyperparameter tuning.
Read more
Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning
Summary
This paper addresses the challenge of the irreducible imitation gap that arises when a teacher policy, trained using reinforcement learning (RL), relies on privileged state information that a student policy cannot access. The authors propose a novel algorithm that learns a shared embedding space to hide agent-specific observations, allowing the teacher to produce imitable policies by construction. The approach utilizes self-supervised contrastive learning to train the shared embedding space in parallel with the teacher policy, ensuring that the teacher does not extract private information. Evaluations across various domains demonstrate that this method significantly enhances student performance while reducing the imitation gap compared to state-of-the-art baselines. The proposed method is task-agnostic, requires no modifications to the reward function, and shows consistent improvements in environments designed to expose the imitation gap.
Methodology
The authors propose a combined reinforcement learning and imitation learning approach that learns a low-dimensional common representation space. This space retains only shared observations between the teacher and student while removing agent-specific state variables. The teacher policy is trained using this representation, which is updated through self-supervised contrastive learning, ensuring that the teacher's gradients do not exploit private information. The training alternates between the embedding space and policy learning, incorporating alignment and stability losses to enhance representational similarity.
Results
The proposed method consistently outperformed strong baseline methods in terms of student performance across various environments designed to expose the imitation gap. The evaluations showed that the student policies distilled from the teacher using the shared embedding space were more effective than those trained with traditional methods, demonstrating a significant reduction in the imitation gap.
Implications
This work has significant implications for robotics and other domains where imitation learning is critical. By enabling the direct imitation of teacher policies without requiring extensive RL fine-tuning, the proposed method can streamline the training process and improve the efficiency of learning in high-dimensional observation spaces.
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
Time Series
- CoAD unifies classification and reconstruction methods to enhance time series anomaly detection.
- The framework addresses limitations of existing OE and MAE methods, improving generalization and masking accuracy.
- CoAD employs probability-informed soft masking to better identify subtle anomalies.
- Extensive experiments show CoAD outperforms state-of-the-art methods in accuracy and efficiency.
Read more
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
Summary
This paper addresses the challenges in time series anomaly detection (TSAD) by proposing a novel framework called CoAD, which integrates classification and reconstruction methods to enhance detection capabilities. Traditional deep learning methods for TSAD have been criticized for their inability to effectively identify subtle and prolonged anomalies. The authors identify limitations in existing Outlier Exposure (OE) and Masked Autoencoder (MAE) approaches, such as poor generalization in OE methods and masking misalignment in MAE methods. CoAD leverages the strengths of both paradigms by using a classification module to generate probability-informed soft masks for the reconstruction module. This cooperative design allows CoAD to better detect complex anomalies while addressing issues related to classification granularity and frequency information. The framework is validated through extensive experiments on benchmark datasets, demonstrating significant improvements over state-of-the-art methods in both accuracy and speed, making it suitable for large-scale, real-time applications.
Methodology
The CoAD framework consists of a classification module that generates soft masks based on anomaly probabilities, which are then used by a reconstruction module to improve anomaly detection accuracy. This approach mitigates the generalization issues of classification methods and the masking misalignment problems of reconstruction methods.
Results
CoAD significantly outperformed both traditional data mining methods and state-of-the-art deep learning approaches in time series anomaly detection tasks. The framework demonstrated improved detection of subtle and prolonged anomalies and was found to be faster and more efficient, making it practical for real-time applications.
Implications
The findings suggest that integrating classification and reconstruction methods can lead to more effective anomaly detection in time series data, with potential applications in various fields such as finance, healthcare, and industrial monitoring where timely anomaly detection is critical.
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Generative Models
Reinforcement Learning
Robotics
- FAV is a general alignment framework that does not rely on restrictive assumptions about generative models.
- The method utilizes Stein Variational Gradient Descent for sample-based variational inference.
- FAV outperforms existing policy extraction methods in robotics manipulation tasks.
- The framework successfully fine-tunes various few-step generative models for image generation.
Read more
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Summary
The paper addresses the challenge of aligning few-step generative models, which often do not conform to the restrictive assumptions of existing alignment frameworks such as tractable likelihoods or specific solvers. The authors introduce FAV (Few-step Generative Models Alignment via Sample-based Variational Inference), a novel framework that requires only sample access to both the generator and the reference distribution. FAV formulates the alignment problem as sampling from a reward-tilted distribution, leveraging Stein Variational Gradient Descent (SVGD) for sample-based variational inference. The method amortizes the updates into the generator parameters through fixed-point regression, allowing it to fine-tune various few-step generators, including GANs and VAEs, without needing specific model families or sampling dynamics. The authors evaluate FAV in two domains: robotics manipulation and image generator alignment, demonstrating its effectiveness and versatility across different tasks and models.
Methodology
FAV formulates the alignment as sampling from a reward-tilted distribution, approximating the target distribution using Stein Variational Gradient Descent. The updates from SVGD are amortized into the generator parameters via fixed-point regression, and the score of the reference distribution is estimated nonparametrically using kernel density estimation.
Results
FAV achieved state-of-the-art performance on 56 offline reinforcement learning tasks and 30 offline-to-online tasks, surpassing traditional policy extraction methods. In image generation, FAV effectively fine-tuned various few-step generators, improving the quality of generated images across different resolutions.
Implications
The proposed framework has significant implications for improving the alignment of generative models in various applications, including robotics and image synthesis. By decoupling the alignment process from specific model assumptions, FAV can enhance the performance and reliability of generative models in real-world tasks.
The Role of Causal Features in Strategic Classification for Robustness and Alignment
Theory
- Causal classifiers can achieve optimal classification error after sufficient user adaptation.
- Out-of-distribution risk can be decomposed into bias and feature utilization, highlighting the advantages of causal classifiers.
- Causal features can align long-term incentives between institutions and users, potentially improving user outcomes.
- Empirical results validate theoretical predictions regarding user behavior in strategic classification contexts.
Read more
The Role of Causal Features in Strategic Classification for Robustness and Alignment
Summary
This paper explores the intersection of causal features and strategic classification, where institutions anticipate user adaptations to maximize utility in classification tasks. The authors establish that causal models can effectively manage distribution shifts caused by user adaptations, leading to improved robustness and alignment of incentives between institutions and users. Key findings include the ability of causal classifiers to achieve optimal classification error after sufficient adaptation, the decomposition of out-of-distribution (OOD) risk into bias and feature utilization terms, and the potential for causal features to align long-term incentives, contrasting with previous literature that emphasizes social costs. Empirical validation on synthetic data supports the theoretical claims, demonstrating that causal classifiers can maintain accuracy while fostering beneficial interactions between institutions and agents.
Methodology
The authors utilize causal models to analyze the impact of user adaptations on classification performance. They establish theoretical results linking causal features to optimal classification error and OOD risk, and conduct empirical validation using synthetic data to demonstrate the practical implications of their findings.
Results
The study shows that causal classifiers outperform traditional classifiers in terms of robustness to user adaptations, achieving optimal classification error under certain conditions. The decomposition of OOD risk reveals that causal features can mitigate the adverse effects of spurious correlations, leading to better alignment of incentives between institutions and users.
Implications
The findings suggest that incorporating causal features into strategic classification systems can enhance decision-making processes in various domains, such as finance and healthcare, where user behavior significantly influences outcomes. This approach may lead to more equitable and effective classification systems that consider long-term user welfare.
Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher
Reinforcement Learning
Robotics
Generative Models
- FA-OPD combines adversarial dual on-policy distillation with a Flow Matching teacher for improved learning from demonstrations.
- The method provides two complementary supervision channels: a reward channel for exploration and an action channel for stabilization.
- FA-OPD shows significant improvements over traditional behavioral cloning and existing on-policy distillation methods.
- The approach is validated across multiple robotic tasks, showcasing robustness against noisy and limited demonstrations.
Read more
Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher
Summary
This paper introduces FA-OPD, an innovative method for on-policy distillation in the context of learning from demonstrations in embodied control tasks. Traditional behavioral cloning approaches, while effective, suffer from exposure bias as they rely solely on expert demonstrations without corrective feedback during deployment. FA-OPD addresses this limitation by employing a Flow Matching (FM) teacher that is co-trained with a lightweight MLP student. The FM teacher provides two distinct channels of supervision: a reward channel that evaluates the expert-likeness of the studentβs actions and encourages exploration, and an action channel that offers dense local targets for stabilization. This dual approach allows the student to generalize beyond the specific demonstrations while maintaining a focus on expert-like behavior. The authors validate their method across six benchmarks in robot navigation, manipulation, and locomotion, demonstrating that FA-OPD outperforms strong baselines and exhibits enhanced robustness in scenarios with noisy or limited demonstrations.
Methodology
FA-OPD utilizes a Flow Matching teacher that is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two types of feedback: a reward signal based on expert-likeness for exploration and action targets for stabilization. This dual feedback mechanism is integrated into a standard Proximal Policy Optimization (PPO) framework, allowing for efficient training and deployment.
Results
The experimental results indicate that FA-OPD consistently outperforms strong baseline methods across six different robotic tasks, demonstrating superior performance and robustness, particularly in environments with noisy or limited demonstrations.
Implications
The proposed method has significant implications for improving learning from demonstrations in robotics, particularly in scenarios where expert demonstrations are scarce or noisy. It opens avenues for more effective training of robotic systems in real-world applications, enhancing their adaptability and performance.
Resource-Constrained Affect Modelling via Variance Regularisation Pruning
Efficient ML
- Introduces Variance-Regularised Pruning (VR) to enhance model robustness in affective computing.
- Evaluates model parameters based on their contribution to both accuracy and variability across users.
- Achieves up to 80% sparsity while maintaining near-baseline performance on the AGAIN dataset.
- Addresses the need for efficient and reliable affective models in resource-constrained environments.
Read more
Resource-Constrained Affect Modelling via Variance Regularisation Pruning
Summary
This paper addresses the challenge of developing affective computing systems that can operate efficiently in resource-constrained environments while maintaining reliable predictions across diverse users. The authors introduce a novel pruning framework called Variance-Regularised Pruning (VR), which enhances traditional model pruning techniques by incorporating cross-participant stability into the sparsification process. Unlike existing methods that focus solely on sparsity or average prediction accuracy, VR evaluates the contribution of each model parameter based on its joint impact on both prediction accuracy and variability across users. This approach is particularly relevant in affective computing, where individual differences in expressivity and situational factors can lead to high variability in model performance. The authors validate their method using the AGAIN dataset, which features arousal annotations from various affect-eliciting game environments. Experimental results indicate that VR can achieve up to 80% sparsity while maintaining competitive performance as measured by the Concordance Correlation Coefficient (CCC), demonstrating its potential for real-world applications in interactive and adaptive systems.
Methodology
The authors propose a post-training pruning framework that evaluates model parameters based on their joint contribution to prediction accuracy and variability across participants. This method does not require architectural changes or retraining, allowing for efficient model compression while maintaining robustness.
Results
The experimental evaluation on the AGAIN dataset shows that the VR approach can maintain competitive CCC performance even at 80% sparsity, indicating that the method effectively balances model efficiency with reliability across diverse users.
Implications
The findings suggest that VR can facilitate the deployment of affective computing systems in real-world interactive environments, such as adaptive games and assistive technologies, where both computational efficiency and user variability are critical considerations.
When Interpretability Is Unequally Distributed: Fairness in Hybrid Interpretable Models
Interpretability
- Introduction of Interpretability Coverage Disparity (ICD) as a measure of fairness in hybrid interpretable models.
- Demonstration of significant disparities in interpretability allocation across demographic groups.
- Development of methods to mitigate ICD with minimal impact on model performance.
- Highlighting the importance of auditing models for both predictive and interpretability fairness.
Read more
When Interpretability Is Unequally Distributed: Fairness in Hybrid Interpretable Models
Summary
This paper addresses the fairness concerns associated with hybrid interpretable models, which combine transparent components with black-box models. The authors introduce the concept of Interpretability Coverage Disparity (ICD), which measures the unequal distribution of interpretability across demographic groups due to the model's routing decisions. Through extensive experiments on four hybrid interpretable learning methods and three standard fairness benchmark datasets, the authors demonstrate that significant disparities in interpretability can arise, particularly in intermediate transparency regimes. They find that certain demographic groups may be disproportionately assigned to the black-box component, leading to inequitable access to interpretable decisions. The paper also explores methods to mitigate ICD by implementing coverage-disparity constraints, which can reduce disparities without significantly impacting model accuracy or sparsity. The findings emphasize the need for auditing hybrid interpretable models not only for predictive fairness but also for their interpretability distribution across different individuals and groups.
Methodology
The authors formalize ICD and utilize predictive multiplicity tools to analyze interpretability distribution across various hybrid interpretable learning methods. They conduct experiments on real-world datasets to assess the extent of ICD and implement coverage-disparity constraints to mitigate the identified disparities.
Results
The experiments reveal substantial ICD in hybrid interpretable models, particularly in scenarios where both interpretable and black-box components are utilized. The introduction of coverage-disparity constraints effectively reduces ICD while maintaining model accuracy and sparsity. In some cases, mitigating ICD also enhances standard algorithmic fairness metrics.
Implications
The findings suggest that hybrid interpretable models should be evaluated for their interpretability distribution, not just their predictive performance. This has implications for regulatory compliance in high-stakes applications, ensuring that all demographic groups have equitable access to interpretable AI decisions.
Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection
Reinforcement Learning
Large Language Models
Efficient ML
- SHIFT enables one-shot, training-free, and label-free data selection for RLVR.
- Utilizes Reasoning-Induced Representation Shift (RIRS) as a proxy for instance utility.
- Achieves better performance than existing training-free diversity and uncertainty-based selection methods.
- Demonstrates effectiveness in ultra-low budget scenarios across various benchmarks.
Read more
Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection
Summary
This paper addresses the challenge of data selection in Reinforcement Learning with Verifiable Rewards (RLVR), which is sensitive to the choice of training instances. Traditional methods rely on training-time optimization signals and require access to verifiable rewards, which can be impractical in specialized domains. The authors introduce SHIFT, a training-free data selection method that operates before any RL training and does not require labels or reward evaluations on the entire candidate pool. SHIFT utilizes a single deterministic reasoning rollout to compute a Reasoning-Induced Representation Shift (RIRS), which serves as a proxy for instance utility. By applying a farthest-first CoreSet strategy in an RIRS-augmented feature space, SHIFT selects a compact subset of instances that maximizes informativeness and coverage. The method demonstrates superior performance across mathematical reasoning and medical QA benchmarks, outperforming existing training-free baselines and showing improved in-domain accuracy and transferability to more challenging evaluation settings. The findings suggest that RIRS-based coverage and quality-weighting provide complementary benefits, and the results are not merely explained by input/output length statistics.
Methodology
The proposed method, SHIFT, runs a single deterministic reasoning rollout for each candidate instance to compute the RIRS, which captures the start-to-end hidden-state delta. It then selects a compact subset of instances by optimizing for informativeness (using RIRS magnitude) and coverage (using a farthest-first CoreSet strategy in an RIRS-augmented feature space). This approach allows for efficient selection without requiring training-time signals or ground-truth rewards.
Results
SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines across mathematical reasoning and medical QA benchmarks. It improves both in-domain accuracy and transfer to more challenging evaluation settings, demonstrating the effectiveness of RIRS-based coverage and quality-weighting. Ablation studies confirm that these components contribute complementary gains to the overall performance.
Implications
The findings suggest that SHIFT can significantly reduce the overhead associated with reward annotation and training in RLVR, making it more accessible for low-resource settings. This approach could be applied in various domains where obtaining labeled data is costly or impractical, such as medical QA and specialized scientific reasoning.
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
Efficient ML
Theory
- NVIDIA's GB10 hardware lacks the necessary interfaces for process-level energy attribution.
- Agentic AI workloads consume significantly more energy than linear workflows, with orchestration structure being a major factor.
- The absence of CPU energy counters on ARM-based systems limits reproducibility in energy measurement.
- The paper proposes a hardware specification for energy attribution and interim calibration methods.
Read more
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
Summary
This paper investigates the limitations of NVIDIA's GB10-based edge AI hardware, specifically the ASUS Ascent GX10, in supporting process-level energy attribution for agentic AI workloads. The authors highlight that agentic AI, which involves multi-step orchestration and tool calls, significantly increases energy consumption compared to linear workflows. Their systematic audit reveals that the ASUS Ascent GX10 lacks essential energy measurement interfaces, such as CPU energy counters and power monitoring protocols, which are crucial for accurate energy attribution. The absence of these capabilities impedes the ability to measure orchestration overhead, which is a significant contributor to energy costs in agentic AI systems. The paper proposes a hardware requirements specification for energy-attributed AI and suggests interim solutions using external metering. The findings emphasize the need for the low-carbon computing community to advocate for improved energy observability in hardware design.
Methodology
The authors conducted a systematic hardware audit of the ASUS Ascent GX10, examining all known energy measurement interfaces available on the platform. They evaluated the presence of standardized energy measurement protocols and assessed the implications of their absence on energy attribution for agentic AI workloads.
Results
The audit revealed that the ASUS Ascent GX10 does not support any per-process energy attribution due to the lack of CPU energy measurement interfaces. The only available telemetry is for GPU power, which limits the ability to measure orchestration overhead accurately. The findings indicate that the ARM edge AI ecosystem is generally lacking in energy observability compared to x86 systems.
Implications
The findings suggest that without proper energy measurement capabilities, compliance with emerging regulations on energy consumption in AI systems will be challenging. The paper calls for hardware manufacturers to prioritize energy observability in the design of edge AI systems to support sustainable computing practices.
Trust Region Q Adjoint Matching
Reinforcement Learning
Optimization
Robotics
- Introduces TRQAM, which stabilizes off-policy RL by controlling deviations from pretrained policies.
- Proves that the path-space KL divergence can be expressed as a function of the trust-region parameter Ξ».
- Demonstrates that TRQAM outperforms existing methods in offline RL and offline-to-online RL settings.
- Identifies the amplification of critic errors as a critical issue in existing adjoint matching methods.
Read more
Trust Region Q Adjoint Matching
Summary
The paper addresses the challenges of off-policy reinforcement learning (RL) using pretrained flow policies, particularly the instability caused by multi-step sampling processes. It introduces Trust Region Q-Adjoint Matching (TRQAM), a novel algorithm that enhances stability in fine-tuning pretrained policies by incorporating a trust-region parameter into the stochastic optimal control (SOC) dynamics. This parameter, denoted as Ξ», is optimized through projected dual descent to maintain a controlled deviation from the pretrained policy, thereby preventing destructive drift caused by critic errors. The authors prove that the path-space Kullback-Leibler (KL) divergence can be expressed as a closed-form function of Ξ», allowing for precise control over policy deviations. Experimental results on 50 OGBench tasks demonstrate that TRQAM significantly outperforms existing methods, achieving a success rate of 68% in offline RL, compared to the strongest baseline at 46%. This work not only highlights the fragility of critic-guided improvements but also provides a robust framework for stable off-policy RL fine-tuning.
Methodology
The methodology involves the formulation of a trust-region parameter Ξ» within the SOC dynamics of Q-learning with Adjoint Matching (QAM). The authors utilize projected dual descent to adaptively optimize Ξ», ensuring that the KL divergence between the fine-tuned and pretrained policies remains within a specified bound. This approach contrasts with conventional methods that apply KL regularization at the loss level, which can lead to instability.
Results
TRQAM was tested on 50 OGBench tasks, where it consistently outperformed prior methods in both offline RL and offline-to-online RL scenarios. The algorithm achieved an overall success rate of 68% in offline RL, significantly surpassing the strongest baseline performance of 46%. The results indicate that TRQAM effectively mitigates the risks associated with critic-induced instability.
Implications
The findings suggest that TRQAM can be a valuable tool for practitioners in reinforcement learning, particularly in scenarios where pretrained policies are utilized. Its ability to maintain stability during fine-tuning could lead to more reliable and effective applications in various domains, including robotics and autonomous systems, where robust decision-making is critical.
HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
Time Series
- HRVConformer processes raw heart rate signals directly, eliminating the need for handcrafted features.
- The architecture combines convolutional layers and Transformer-based attention mechanisms for improved classification performance.
- The model achieved an AUC of 83.23% and accuracy of 74.56%, surpassing several baseline models.
- The improved Pan-Tompkins algorithm enhanced the quality of heart rate signal extraction from ECG recordings.
Read more
HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
Summary
The paper introduces HRVConformer, a novel deep learning architecture designed for classifying hypoxic-ischemic encephalopathy (HIE) using instantaneous heart rate (HR) signals. Unlike traditional methods that depend on handcrafted features, HRVConformer processes raw HR signals in an end-to-end fashion, effectively capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. This architecture combines convolutional layers for local feature extraction with Transformer-based attention mechanisms for global context modeling, enhancing signal representation and classification performance. The model was trained on a substantial dataset of 1,573 one-hour epochs, including 259 expert-annotated epochs and a large set of weakly labeled data. A 314-hour validation set was utilized for performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. The heart rate signals were extracted from ECG recordings using an improved Pan-Tompkins algorithm, which significantly improved signal quality and data availability. Experimental results showed that HRVConformer achieved an AUC of 83.23% and an accuracy of 74.56% on the test set, outperforming baseline models such as Transformer, ResNet50, and fully convolutional networks. This work highlights the potential of integrating convolutional and Transformer components for more accurate and automated HIE classification using HR signals.
Methodology
The HRVConformer architecture employs a hybrid approach that integrates convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modeling. The model was trained using supervised learning on a large dataset of heart rate signals, with an emphasis on improving signal extraction through an enhanced Pan-Tompkins algorithm.
Results
HRVConformer achieved an AUC of 83.23% and an accuracy of 74.56% on the independent test set, outperforming baseline models such as Transformer, ResNet50, and fully convolutional networks, demonstrating the effectiveness of the hybrid architecture in classifying HIE from heart rate signals.
Implications
The proposed method offers a more accurate and automated approach for assessing hypoxic-ischemic encephalopathy in neonates, potentially leading to improved early diagnosis and treatment strategies. The integration of heart rate variability analysis into clinical practice could facilitate personalized treatment plans and better long-term outcomes for affected infants.
Revisiting Metafeatures to Explain Model Differences on Tabular Data
Interpretability
- Dataset meta-features do not robustly explain performance gaps between neural networks and tree-based models.
- One weak association is found for non-foundation vs. foundation model gaps, which does not generalize well.
- A robust association between TabICLv2 and TabPFN-2.6 improves held-out predictions.
- Meta-feature predictors do not significantly outperform simple baseline models.
Read more
Revisiting Metafeatures to Explain Model Differences on Tabular Data
Summary
This paper investigates the role of dataset meta-features in explaining performance differences between various model families on tabular data tasks. With the emergence of tabular foundation models, the challenge of selecting the most suitable model for a given dataset has intensified. The authors utilize the TabArena benchmark to analyze performance gaps across 51 datasets and correlate them with model-agnostic dataset descriptors. Through rigorous statistical testing, the study reveals that no meta-features robustly explain the performance gap between neural networks and tree-based models. For the comparison between non-foundation and foundation models, a single association is identified, but it fails to generalize in leave-one-dataset-out predictions. Additionally, one robust association between TabICLv2 and TabPFN-2.6 improves held-out predictions. The findings indicate that while meta-features may provide some explanatory power, they do not significantly enhance predictive capabilities over simple baseline models, highlighting the heterogeneous nature of tabular datasets and the limitations of global meta-feature approaches.
Methodology
The authors employed a systematic approach using the TabArena benchmark, which includes a controlled evaluation setup with predefined train/validation/test splits. They analyzed dataset-level performance gaps by comparing the best-performing models from different families based on validation errors and conducted statistical tests to assess the significance of associations between meta-features and performance differences.
Results
The analysis revealed that for the neural network versus tree-based model comparison, no robust meta-feature associations were found. For the non-foundation versus foundation model comparison, only one weak association was identified, which did not generalize well in leave-one-dataset-out tests. Conversely, a robust association between TabICLv2 and TabPFN-2.6 was found, which improved held-out predictions.
Implications
The findings suggest that relying on global meta-features for model selection in tabular data tasks may be ineffective due to the diversity of datasets. This has implications for practitioners in the field, indicating that more nuanced approaches may be necessary for model selection and that further research is needed to develop better predictive frameworks.
Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences
Generative Models
Theory
Efficient ML
- Introduces a unifying taxonomy for energy-based model training methods based on spatial and temporal variations.
- Identifies limitations of existing temporal and spatial methods, particularly in multi-modal settings and support mismatch.
- Proposes Spatiotemporal Noise-Contrastive Estimation (stNCE) as a solution to jointly learn spatial and temporal differences.
- Demonstrates that stNCE leads to new training objectives that outperform existing methods on various benchmarks.
Read more
Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences
Summary
This paper addresses the challenge of learning energy-based models (EBMs) from data samples, focusing on the limitations of existing methods that utilize stochastic interpolants to corrupt data samples at various noise levels. The authors identify distinct failure modes in both temporal and spatial methods for estimating density variations. To overcome these limitations, they propose a novel framework called Spatiotemporal Noise-Contrastive Estimation (stNCE), which jointly learns spatial and temporal variations. This framework not only unifies existing methods but also introduces new training objectives that enhance performance. The experiments conducted on images and molecular data demonstrate that stNCE achieves competitive results compared to state-of-the-art density estimation techniques, providing a clearer understanding of the conditions under which these methods succeed.
Methodology
The authors develop a framework called Spatiotemporal Noise-Contrastive Estimation (stNCE) that combines spatial and temporal learning approaches. This framework is based on the idea of jointly estimating variations in density across both data points and time, addressing the shortcomings of existing methods that focus solely on one aspect. The methodology involves defining a joint density over data and time, and training the model to learn this density through a binary classification task.
Results
The experiments show that the stNCE framework achieves performance that is competitive with state-of-the-art density estimation methods across synthetic, image, and molecular benchmarks. The results indicate that the joint learning of spatial and temporal differences significantly reduces estimation errors compared to traditional methods.
Implications
The findings suggest that a unified approach to learning energy-based models can enhance their applicability in various fields, including computational chemistry and generative modeling. By addressing the limitations of existing methods, stNCE could lead to more robust and efficient density estimation techniques, facilitating better sampling and alignment with human preferences in AI systems.
Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning
NLP
Large Language Models
Theory
- KETs provide a categorical framework that unifies various Transformer implementations.
- The use of detached predictive carriers allows for effective self-conditioning without leaking future information.
- Quadratic KET outperforms other causal architectures on larger datasets like WikiText-2 and WikiText-103.
- The predict-detach regime yields the most substantial performance gains across all datasets.
Read more
Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning
Summary
This paper introduces Kan Extension Transformers (KETs), a novel categorical framework that unifies various Transformer architectures by viewing Transformer layers as weighted structured extension operators. The author argues that standard attention represents a singleton-neighborhood case, while KETs operate in a higher-order simplicial context, bridging to diffusion-style completion. The paper emphasizes the importance of using detached predictive carriers instead of teacher-forced hidden states to avoid leaking future information, thus enabling a valid self-conditioning mechanism. The experimental validation includes 12 different Transformer implementations across datasets like Penn Treebank, WikiText-2, and WikiText-103, demonstrating that while quadratic KET performs strongly in strict-causal settings, the most significant performance improvements arise from the predict-detach regime rather than merely altering neighborhood structures.
Methodology
The methodology involves framing Transformer layers as weighted left-Kan-style extensions over simplicial neighborhoods. The paper explores different neighborhood systems, including token-level, topological, and simplicial structures, and compares the performance of various Transformer models through comprehensive experiments on benchmark datasets.
Results
The results indicate that the quadratic KET model is the strongest in strict-causal settings on WikiText-2 and WikiText-103. However, the most significant performance improvements are attributed to the predict-detach regime, which allows for richer noncausal neighborhoods when using detached predictive carriers.
Implications
The findings suggest that KETs could enhance the design of future Transformer models by integrating categorical approaches and self-conditioning mechanisms, potentially leading to better generalization and performance in various NLP tasks.
WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Large Language Models
Reinforcement Learning
Efficient ML
- WINDQuant reformulates mixed-precision quantization as a sequential decision-making problem, allowing for adaptive bit-width allocation.
- The framework operates at a fine-grained level, enabling more precise and flexible quantization strategies compared to existing methods.
- WINDQuant demonstrates competitive performance on LLaMA models with up to 70 billion parameters without requiring full model retraining.
- The approach integrates activation-aware mechanisms and supports a range of quantization operators from 1-bit to 8-bit.
Read more
WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Summary
WINDQuant introduces a novel reinforcement learning-based framework for mixed-precision quantization of large language models (LLMs), addressing the challenges of maintaining performance while reducing memory and computational costs. Traditional quantization methods, such as post-training quantization (PTQ) and quantization-aware training (QAT), often lead to significant accuracy degradation or require extensive retraining. WINDQuant overcomes these limitations by employing a fine-grained, column-chunk level approach to allocate bit-widths and quantization treatments under a global storage budget. The framework utilizes Proximal Policy Optimization (PPO) combined with activation-aware calibration and effective-bit accounting to optimize the quantization process. Experimental results on various LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while minimizing optimization overhead compared to retraining-based methods. This work highlights the potential of reinforcement learning as an effective controller for adaptive mixed-precision quantization in LLMs.
Methodology
WINDQuant employs a reinforcement learning framework where quantization is treated as a finite-horizon sequential decision problem. The agent observes features of column-level chunks within a layer and selects actions to determine the appropriate bit-width for quantization. The training process utilizes Proximal Policy Optimization (PPO) with budget-aware action masking and a quality penalty based on perplexity to guide the learning of effective quantization policies.
Results
WINDQuant achieves competitive performance in ultra-low-bit quantization settings across various LLaMA models, significantly reducing computational and memory overhead compared to traditional retraining-based quantization methods. The framework successfully maintains model accuracy while allowing for flexible precision assignments.
Implications
The findings suggest that WINDQuant can be applied to enhance the efficiency of deploying large language models in resource-constrained environments, making it feasible to run complex models on edge devices. The adaptive quantization strategy could lead to broader applications in various domains requiring efficient model inference.
On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective
Theory
- Introduces a theoretical framework for studying TTA learnability.
- Defines (Ο΅, Ξ΄)-Recovery Complexity and (Ο΅, Ο)-TTA Learnability metrics.
- Develops a unified model for analyzing non-stationary test streams.
- Derives bounds on recovery complexity, highlighting fundamental limits of TTA.
Read more
On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective
Summary
This paper addresses the learnability of Test-Time Adaptation (TTA) in the context of non-stationary test streams, a topic that has not been thoroughly explored despite TTA's empirical success. The authors introduce a theoretical framework that includes (Ο΅, Ξ΄)-Recovery Complexity and (Ο΅, Ο)-TTA Learnability to measure TTA's effectiveness in adapting to distribution shifts without labeled data. They propose a novel discrete surrogate model for non-stationary test streams, which allows for a unified analysis of both gradual and abrupt shifts. The framework leads to the derivation of order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic trade-off between recovery speed and excess risk. The study also connects (Ο΅, Ο)-TTA Learnability to dynamic regret, clarifying the relationship and differences from existing regret-based analyses. Overall, this work provides a principled approach to understanding the learnability of TTA under complex conditions, offering insights into its reliability and performance over time.
Methodology
The authors develop a theoretical framework that combines a Wasserstein-quantized surrogate for distribution shifts with a Ο-mixing model for temporal dependence. This allows for a comprehensive analysis of TTA under non-stationary conditions. They derive recovery complexity metrics and establish bounds to evaluate TTA algorithms' performance.
Results
The paper derives order-wise matching lower and upper bounds on recovery complexity, revealing intrinsic limits of TTA and a trade-off between recovery speed and excess risk. The findings provide unified learnability guarantees for TTA, enhancing the understanding of its reliability in non-stationary environments.
Implications
This work has significant implications for the development of robust machine learning models that can adapt to changing environments without labeled data. It offers a theoretical foundation for future research in TTA, potentially improving applications in various domains such as computer vision, natural language processing, and beyond.
Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run
Optimization
Efficient ML
Theory
- Introduces an efficient method for crafting canaries for one-run privacy auditing.
- Combines influence functions with bilevel optimization to enhance canary detectability and diversity.
- Empirical validation shows improved privacy leakage estimates with reduced computational costs.
- Addresses the issue of interference among canaries, which affects membership inference accuracy.
Read more
Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run
Summary
This paper addresses the challenge of privacy auditing in machine learning models, specifically focusing on the crafting of 'canary' data points used in membership inference attacks (MIAs). The authors propose a novel approach to optimize canaries for one-run privacy auditing, which is more efficient than traditional multi-run methods. They introduce a greedy initialization based on influence functions to select canaries that are both highly detectable and minimally interfering. Additionally, a bilevel optimization framework is developed that promotes diversity among canaries in the embedding space. This method allows for the efficient updating of a single model during canary optimization, significantly reducing computational costs. Experimental results demonstrate that the proposed method yields stronger privacy leakage estimates compared to existing canary crafting techniques while maintaining lower computational demands.
Methodology
The authors utilize a greedy selection algorithm based on influence functions to identify canaries with high self-influence and low cross-influence. They then employ a bilevel optimization approach that incorporates a diversity regularizer to ensure canaries are orthogonal in the representation space, allowing for efficient updates to a single model rather than retraining from scratch.
Results
The proposed method achieves competitive or superior privacy leakage estimates compared to existing canary crafting methods, while requiring significantly less computational resources. The experiments conducted on WideResNet and CNN architectures validate the effectiveness of the approach.
Implications
This research has significant implications for enhancing privacy auditing techniques in machine learning, particularly in contexts where computational efficiency is critical. The findings could lead to more robust privacy-preserving methods in various applications, including healthcare, finance, and any domain where sensitive data is used for model training.
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
Optimization
Efficient ML
- Introduces a reparametrization of Shampoo-based methods to support BFP16 storage.
- Reduces computational overhead by updating only part of the basis through QR decomposition in a subspace.
- Improves performance of KL-SOAP to match or exceed KL-Shampoo under BFP16 storage.
- Compatible with various subspace selection strategies, enhancing flexibility in optimization.
Read more
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
Summary
This paper presents a novel reparametrization of Shampoo-based optimization methods, specifically KL-Shampoo and SOAP, to enhance their efficiency when using BFloat16 (BFP16) storage. Traditional implementations of these methods rely on QR decomposition, which is computationally expensive and requires single-precision arithmetic, leading to increased memory and time costs, especially with large preconditioning matrices. The authors propose a new approach that allows for the storage of preconditioning factors in BFP16 while maintaining performance. By updating only a subset of basis vectors through QR decomposition in a subspace, the proposed method reduces computational overhead and mitigates performance degradation associated with BFP16 storage. The reparametrization enables the combination of updated and unchanged basis vectors, forming a complete basis and allowing for compatibility with various subspace selection strategies. Empirical results demonstrate that this approach improves the performance of SOAP-type methods and allows KL-SOAP to match or exceed the performance of KL-Shampoo, thereby making Shampoo-based methods more efficient in terms of memory and computation.
Methodology
The authors propose a reparametrization of the preconditioning factors in Shampoo-based methods. Instead of storing the full preconditioning matrix, they store a modified set of parameters that allows for efficient updates of a subset of basis vectors. This is achieved through QR decomposition in a subspace, which significantly reduces computational costs while supporting BFP16 storage.
Results
Empirical evaluations show that the reparametrized methods outperform traditional implementations, particularly in scenarios utilizing BFP16 storage. The performance gap between KL-Shampoo and KL-SOAP is effectively closed, demonstrating the efficacy of the proposed approach in enhancing both memory efficiency and computational speed.
Implications
This work has significant implications for optimizing neural network training processes, particularly in resource-constrained environments where memory and computational efficiency are critical. The proposed methods can be applied broadly to various Shampoo-based optimization techniques, potentially leading to faster training times and reduced resource consumption.
PLS in the Mirror of Self-Attention
Theory
Optimization
- PLS can be viewed as a linearized version of self-attention, bridging traditional statistical methods and modern neural network paradigms.
- The reformulation of PLS as a regression problem allows for greater flexibility in modeling relationships between predictors and responses.
- Introducing a modified cost function enhances the ability of PLS to handle non-orthogonal transformations and nonlinear activations.
- The study provides insights into dimensionality normalization in self-attention mechanisms, suggesting potential improvements in learning efficiency.
Read more
PLS in the Mirror of Self-Attention
Summary
This paper explores the relationship between Partial Least Squares (PLS) and self-attention mechanisms within neural networks. The author presents PLS as a linearized form of self-attention, allowing for its study in the context of neural architectures. The paper discusses the mathematical foundations of PLS, including its formulation for dimensionality reduction and predictor selection, and contrasts these with the self-attention mechanism used in Transformer models. The author reformulates PLS as a regression problem rather than a cross-covariance minimization task, introducing a modified cost function that incorporates reconstruction errors for both predictors and responses. This approach aims to enhance the flexibility of PLS by allowing non-orthogonal transformations and nonlinear activations, potentially leading to improved predictive performance. The paper concludes by discussing the implications of this reformulation for future research and applications in machine learning.
Methodology
The paper employs mathematical formulations to relate PLS and self-attention, reformulating the PLS optimization problem into a regression framework. It introduces a new cost function that balances latent variable extraction with regression accuracy, allowing for gradient descent methods to solve the optimization problem.
Results
The proposed reformulation of PLS demonstrates that it can effectively model relationships in data while accommodating non-linear transformations. The new cost function provides a more flexible approach to PLS, potentially leading to better predictive performance in various applications.
Implications
This work suggests that integrating concepts from traditional statistical methods like PLS with modern neural network architectures can lead to enhanced learning algorithms. It opens avenues for further research into hybrid models that leverage the strengths of both approaches, particularly in fields requiring predictive modeling with high-dimensional data.
MobileMoE: Scaling On-Device Mixture of Experts
NLP
Large Language Models
Efficient ML
- MobileMoE establishes a new Pareto frontier for on-device LLMs with sub-billion active parameters.
- The proposed MoE scaling law optimizes architecture for mobile memory and compute constraints.
- MobileMoE models achieve 2-4Γ fewer inference FLOPs compared to leading dense LLMs.
- The first efficient MoE inference is demonstrated on commodity smartphones with significant speed improvements.
Read more
MobileMoE: Scaling On-Device Mixture of Experts
Summary
The paper introduces MobileMoE, a novel family of on-device Mixture-of-Experts (MoE) language models designed for efficient deployment on mobile devices with sub-billion active parameters. The authors identify a new Pareto frontier for on-device large language models (LLMs) by optimizing MoE architectures under mobile memory and compute constraints. They propose an on-device MoE scaling law that emphasizes moderate sparsity and shared experts, which leads to optimal memory and compute usage. MobileMoE is trained using a comprehensive four-stage recipe that includes pre-training, mid-training, instruction fine-tuning, and quantization-aware training on open-source datasets. The results show that MobileMoE models outperform existing dense LLMs in terms of accuracy while requiring significantly fewer inference FLOPs. Furthermore, the paper demonstrates the practical deployment of MobileMoE on flagship smartphones, achieving faster inference times compared to dense baselines while maintaining lower memory usage. This work highlights the potential of MoE architectures for efficient on-device AI applications, paving the way for enhanced performance in edge computing environments.
Methodology
The authors developed MobileMoE by formulating a novel MoE scaling law tailored for sub-billion active parameters. They designed a four-stage training pipeline that includes pre-training, mid-training, instruction fine-tuning, and quantization-aware training. The models were trained on open-source datasets, and custom fused MoE kernels were created for efficient inference on mobile devices.
Results
MobileMoE models were evaluated across 14 benchmarks, showing that they match or exceed the performance of existing dense LLMs while using 2-4Γ fewer inference FLOPs. MobileMoE-S, for instance, achieved 1.8-3.8Γ faster prefill and 2.2-3.4Γ faster decoding compared to the dense baseline MobileLLM-Pro. Additionally, MobileMoE-M matched the accuracy of the state-of-the-art MoE OLMoE-1B-7B with approximately 60% fewer parameters.
Implications
The findings suggest that MoE architectures can significantly enhance the efficiency and performance of on-device AI applications, enabling low-latency and cost-effective solutions for smartphones and other edge devices. This work could lead to broader adoption of advanced LLMs in mobile applications, improving user experiences while maintaining privacy.
Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation
Reinforcement Learning
Robotics
Theory
- Introduces a Bayesian framework for validating learned landing controllers under uncertainty.
- Defines deployment capability as the probability of meeting safety constraints during landing.
- Utilizes Bayesian inference to quantify uncertainty in deployment readiness.
- Develops a sequential validation mechanism for real-time decision-making during testing.
Read more
Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation
Summary
This paper addresses the challenge of validating learned autonomous landing controllers under uncertainty, particularly in safety-critical applications. Traditional evaluation metrics such as cumulative reward and empirical success frequency are insufficient for assessing deployment readiness. The authors propose a Bayesian approval framework that quantifies the probability of a controller meeting safety constraints during landing under uncertain operating conditions. This framework incorporates Bayesian posterior inference to assess uncertainty in deployment capability and introduces posterior approval probability and risk metrics for deployment-oriented evaluation. A sequential validation mechanism is also developed to facilitate decision-making during rollout testing. Simulation experiments demonstrate that conventional empirical metrics can lead to overconfident deployment interpretations, while the proposed Bayesian approach offers a more calibrated assessment of readiness for deployment. The methodology is designed to be independent of the underlying learning algorithm, making it applicable to various types of controllers, including reinforcement learning policies and heuristic systems.
Methodology
The authors develop a Bayesian deployment approval framework that models landing trajectories as Bernoulli safety outcomes. They use Bayesian updating to quantify posterior uncertainty regarding deployment capability and introduce metrics for posterior approval probability and false-approval risk. A sequential validation mechanism is implemented to support ongoing decision-making during rollout testing.
Results
Simulation studies reveal that the Bayesian framework provides a more accurate assessment of deployment readiness compared to traditional empirical metrics. The results indicate that learned controllers may appear overconfident in their deployment capability when evaluated solely on empirical success rates, highlighting the importance of uncertainty quantification.
Implications
The proposed framework has significant implications for the deployment of autonomous systems in safety-critical environments, such as aviation and robotics. It offers a statistically grounded approach to ensure that learned controllers are adequately validated before deployment, potentially reducing the risk of unsafe operations.