AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
64
Papers today
8h
Update frequency
7
Days of history
ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations
Interpretability
- ConceptTracer is an interactive application for analyzing neural representations based on human-understandable concepts.
- It incorporates information-theoretic measures to quantify concept saliency and selectivity in neural activations.
- The tool was validated using representations from TabPFN, demonstrating its utility in identifying interpretable neurons.
- ConceptTracer enhances the understanding of how neural networks encode information, contributing to mechanistic interpretability.
Read more
ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations
Summary
The paper introduces ConceptTracer, an interactive tool designed to analyze neural representations through human-interpretable concepts. Despite the impressive performance of neural networks, their decision-making processes remain opaque, which can hinder their application in critical fields. ConceptTracer addresses this gap by integrating two information-theoretic measures—concept saliency and selectivity—that allow users to identify neurons that respond strongly to specific concepts. The authors demonstrate the effectiveness of ConceptTracer using representations learned by TabPFN, showcasing its ability to facilitate the discovery of interpretable neurons. This tool provides a practical framework for understanding how neural networks encode concept-level information, thereby enhancing mechanistic interpretability and user trust in AI systems.
Methodology
The authors developed two information-theoretic measures: saliency, which quantifies the strength of association between neuron activations and concepts using normalized mutual information, and selectivity, which assesses the specialization of neurons for particular concepts. They employed nonparametric permutation testing to establish empirical null distributions for these metrics, allowing for the evaluation of neuron-concept associations against chance levels.
Results
The application of ConceptTracer on TabPFN representations revealed distinct neurons that are strongly associated with specific concepts, thereby facilitating the identification of interpretable neurons. The results indicate that the tool effectively supports the analysis of neural representations, providing insights into the encoding of concept-level information.
Implications
ConceptTracer has significant implications for improving the interpretability of neural networks, particularly in high-stakes domains such as healthcare. By enabling researchers and practitioners to explore and understand the inner workings of neural models, it fosters greater trust and transparency in AI systems. Additionally, the tool can serve as a foundation for future research in mechanistic interpretability and concept-based explainability.
Fraud Detection System for Banking Transactions
Theory
Efficient ML
Optimization
- The framework utilizes the PaySim dataset to simulate financial transactions for effective fraud detection.
- Employs CRISP-DM methodology for structured data analysis and model development.
- Implements SMOTE to address class imbalance in the dataset, improving minority class detection.
- Evaluates multiple machine learning models, with hyperparameter tuning to enhance performance.
Read more
Fraud Detection System for Banking Transactions
Summary
This paper addresses the growing challenge of fraud detection in digital banking transactions, exacerbated by the rise of online payment systems. The authors propose a machine learning-based framework that leverages the PaySim synthetic financial transaction dataset to enhance fraud detection capabilities. Utilizing the CRISP-DM methodology, the study conducts a thorough exploratory analysis and feature refinement, followed by a comparative evaluation of various classification models, including Logistic Regression, Decision Tree, Random Forest, and XGBoost. To counteract the class imbalance inherent in financial transaction data, the Synthetic Minority Over-sampling Technique (SMOTE) is employed, and model performance is further optimized through hyperparameter tuning using GridSearchCV. The results indicate that the proposed framework not only improves detection accuracy but also provides a scalable solution for FinTech environments, ultimately aiming to reduce false-positive rates and enhance fraud prevention strategies.
Methodology
The study follows the CRISP-DM framework, which includes stages of business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It involves hypothesis-driven exploratory analysis, feature refinement, and comparative assessment of various classification models, while addressing class imbalance using SMOTE and optimizing model performance through GridSearchCV.
Results
The comparative analysis of models revealed that tree-based classifiers, particularly XGBoost, outperformed traditional models in terms of detection accuracy. The application of SMOTE significantly improved the representation of fraudulent transactions, leading to enhanced model recall and reduced false-positive rates. The optimized model demonstrated robust performance, making it suitable for deployment in real-world FinTech systems.
Implications
The findings suggest that machine learning frameworks can significantly improve fraud detection in banking transactions, providing a scalable solution that can adapt to evolving fraud strategies. This has implications for enhancing security measures in digital payment systems and reducing financial losses due to fraud.
ODE-free Neural Flow Matching for One-Step Generative Modeling
Generative Models
- OT-NFM allows for one-step image generation without ODE solvers at inference.
- Mean collapse is identified as a unique failure mode in neural flow models, necessitating optimal transport for effective learning.
- Two scalable optimal transport coupling strategies are introduced, enhancing the practicality of OT-NFM for large-scale applications.
- Empirical results show OT-NFM's competitive performance in generating high-quality samples with reduced computational cost.
Read more
ODE-free Neural Flow Matching for One-Step Generative Modeling
Summary
This paper introduces Optimal Transport Neural Flow Matching (OT-NFM), a novel generative modeling framework that directly learns the transport map from noise to data without relying on ordinary differential equations (ODEs) during inference. Traditional diffusion and flow matching models require multiple evaluations of neural networks to generate samples, which can be computationally expensive. OT-NFM circumvents this by enabling one-step generation through a single forward pass. The authors identify a critical issue known as 'mean collapse,' where naive training leads to outputs converging to the mean of the data due to inconsistent noise-data pairings. To address this, they establish that consistent coupling is essential for effective learning and propose optimal transport pairings as a solution. The paper also presents scalable coupling strategies that allow for efficient training without the need for precomputing full transport plans. Empirical results on synthetic benchmarks and image datasets (MNIST and CIFAR-10) demonstrate that OT-NFM achieves competitive sample quality while significantly reducing inference costs compared to multi-step methods.
Methodology
The authors propose a framework that learns the flow map directly using neural flows, avoiding the need for velocity fields or ODE integration. They analyze the training process to ensure consistent coupling between noise and data, employing optimal transport strategies to mitigate mean collapse. Two coupling strategies, precomputed minibatch OT and online refinement via LOOM, are introduced to facilitate efficient training.
Results
Experiments on synthetic benchmarks and image generation tasks (MNIST and CIFAR-10) reveal that OT-NFM achieves sharp and diverse samples comparable to those generated by multi-step methods, but with significantly lower inference costs, validating the effectiveness of the proposed framework.
Implications
The development of OT-NFM could streamline generative modeling processes in various applications, particularly in scenarios requiring rapid sample generation, such as real-time image synthesis and interactive applications in computer vision.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
Reinforcement Learning
Robotics
Optimization
- Introduces PriPG-RL, a framework for RL in partially observable environments using a privileged planner.
- Utilizes an anytime-feasible MPC algorithm (REAP) to provide structured guidance to the learning agent.
- Develops the Planner-to-Policy Soft Actor-Critic (P2P-SAC) method to distill knowledge from the planner.
- Demonstrates improved sample efficiency and policy performance in simulations and real-world applications.
Read more
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
Summary
This paper presents PriPG-RL, a novel framework for training reinforcement learning (RL) agents in partially observable environments by leveraging a privileged planner agent during training. The framework is formalized as a Partially Observable Markov Decision Process (POMDP), where the planner has access to an approximate dynamical model and privileged state information, while the learning agent operates with limited observations. The authors introduce an anytime-feasible Model Predictive Control (MPC) algorithm, REAP, to serve as the planner agent, which guarantees feasible solutions at any computation point. The learning agent employs a Planner-to-Policy Soft Actor-Critic (P2P-SAC) method that distills knowledge from the planner to enhance sample efficiency and policy performance. The theoretical foundations of the framework are rigorously analyzed, and the approach is validated through simulations in NVIDIA Isaac Lab and real-world deployment on a Unitree Go2 quadruped robot navigating complex environments. The results demonstrate significant improvements in learning efficiency and policy robustness compared to traditional RL methods under partial observability.
Methodology
The PriPG-RL framework consists of two agents: a planner agent that uses an anytime-feasible MPC algorithm (REAP) for guidance and a learning agent that employs the P2P-SAC method. The planner agent provides privileged information and feasible solutions during training, while the learning agent distills this knowledge to improve its performance in a partially observable setting.
Results
The proposed PriPG-RL framework was validated through simulations in NVIDIA Isaac Lab and successfully deployed on a Unitree Go2 quadruped robot. The results indicated that the learning agent achieved higher sample efficiency and better final policy performance compared to traditional RL approaches, effectively navigating complex, obstacle-rich environments.
Implications
The PriPG-RL framework has potential applications in robotics, particularly in scenarios where agents must operate under partial observability. The ability to leverage privileged information during training can lead to more robust and efficient learning processes in real-world tasks, enhancing the performance of autonomous systems in dynamic environments.
A Novel Edge-Assisted Quantum-Classical Hybrid Framework for Crime Pattern Learning and Classification
Optimization
- First comprehensive quantum-classical comparison for crime analytics with statistical validation.
- Novel quantum circuit architecture exploits crime feature correlations through targeted entanglement.
- Hybrid architectures (Q→C and C→Q) enhance classification performance and efficiency.
- Quantum-inspired methods show competitive accuracy and reduced parameter requirements.
Read more
A Novel Edge-Assisted Quantum-Classical Hybrid Framework for Crime Pattern Learning and Classification
Summary
This paper presents a novel framework for crime pattern analysis that combines quantum and classical machine learning techniques to address the challenges posed by high-dimensional, imbalanced crime datasets. The authors evaluate four computational paradigms: pure quantum models, classical baseline machine learning models, and two hybrid quantum-classical architectures. Using 16 years of crime statistics from Bangladesh, the study assesses classification performance and computational efficiency through rigorous cross-validation. The results indicate that quantum-inspired approaches, particularly the Quantum Approximate Optimization Algorithm (QAOA), achieve up to 84.6% accuracy while requiring fewer trainable parameters than classical methods. The proposed correlation-aware circuit design incorporates domain-specific feature relationships, enhancing the performance of quantum models. The hybrid approaches demonstrate competitive training efficiency, making them suitable for resource-constrained environments, such as wireless sensor networks in smart city surveillance systems. The findings suggest that quantum-enhanced machine learning can effectively analyze structured crime data and encourage further exploration with larger datasets and realistic quantum hardware.
Methodology
The study employs a systematic four-paradigm comparison framework that includes pure quantum models, classical machine learning models, and hybrid architectures. It utilizes 16 years of crime statistics data, applying rigorous cross-validation methods to evaluate classification performance and computational efficiency. The quantum circuit design is informed by Spearman correlation analysis to exploit feature relationships.
Results
The experimental results reveal that quantum-inspired approaches, especially QAOA, achieve up to 84.6% accuracy, outperforming classical baselines in terms of accuracy while requiring fewer trainable parameters. The hybrid approaches also demonstrate competitive training efficiency, indicating their potential for practical deployment in edge computing scenarios.
Implications
The proposed framework has significant implications for law enforcement and predictive policing, particularly in urban environments where crime data is complex and imbalanced. The ability to perform localized crime analytics with minimal communication costs makes it suitable for deployment in smart city infrastructures, enhancing public safety through efficient resource allocation and intervention strategies.
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
Optimization
- Generalizes the probabilistic reparameterization approach to handle non-equidistant discrete variables.
- Demonstrates the effectiveness of Bayesian optimization in mixed-variable settings using Gaussian process surrogates.
- Conducts systematic benchmarks to optimize kernel formulations and acquisition functions.
- Establishes a practical framework for optimizing mixed-variable problems in natural sciences.
Read more
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
Summary
This paper addresses the challenge of optimizing expensive black-box objectives in mixed-variable search spaces, which is prevalent in the natural sciences. The authors propose a generalized approach to the probabilistic reparameterization (PR) method to accommodate non-equidistant discrete variables, enabling gradient-based optimization in fully mixed-variable settings using Gaussian process (GP) surrogates. They conduct systematic benchmarks on both synthetic and experimental objectives to optimize kernel formulations and demonstrate the robustness of their generalized PR method. The study shows that when combined with a modified Bayesian optimization (BO) workflow, their approach can effectively optimize highly discontinuous and discretized objective landscapes. This work establishes a practical BO framework tailored for mixed optimization problems in scientific contexts, particularly beneficial for autonomous laboratory settings characterized by noise, discretization, and limited data.
Methodology
The authors extend the probabilistic reparameterization framework to support discrete variables, optimizing the acquisition function and kernel specifications for Gaussian processes. They benchmark their method against synthetic and real-world problems, comparing the performance of different acquisition functions, specifically Expected Improvement and Upper/Lower Confidence Bound methods, to assess the trade-off between exploration and exploitation.
Results
The generalized PR method shows robust performance across various optimization landscapes, effectively handling mixed-variable problems. The benchmarks indicate that the proposed approach can efficiently optimize objective functions that are highly discontinuous and discretized, outperforming traditional methods in real-world applications.
Implications
This research has significant implications for optimizing experimental and simulation-based tasks in the natural sciences, particularly in autonomous laboratory settings. The proposed framework can enhance the efficiency of material discovery and other scientific optimizations by reducing the number of required evaluations and improving the exploration of complex search spaces.
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Time Series
Graph Learning
Interpretability
- DSPR effectively decouples stable trends from regime-dependent dynamics in industrial time series forecasting.
- The framework incorporates an Adaptive Window for transport delays and a Physics-Guided Dynamic Graph for interaction structures.
- DSPR achieves state-of-the-art predictive performance with high accuracy and physical plausibility.
- The model provides interpretable insights consistent with known domain mechanisms, enhancing scientific understanding.
Read more
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Summary
The paper introduces DSPR (Dual-Stream Physics-Residual Networks), a novel framework designed for accurate forecasting of industrial time series while ensuring physical plausibility under varying operational conditions. Traditional data-driven models often excel in statistical performance but fail to account for regime-dependent dynamics and transport delays, leading to untrustworthy predictions. DSPR addresses these challenges by decoupling stable temporal patterns from regime-dependent residual dynamics through two distinct streams: the Trend Stream, which captures statistical temporal evolution, and the Residual Stream, which focuses on residual dynamics using an Adaptive Window module for flow-dependent transport delays and a Physics-Guided Dynamic Graph for time-varying interaction structures. The framework was validated on four industrial benchmarks, demonstrating significant improvements in forecasting accuracy and robustness during regime shifts, achieving state-of-the-art performance metrics. Additionally, DSPR provides interpretable insights into learned interaction structures and adaptive lags, aligning with known physical mechanisms. This work highlights the potential of integrating physics-informed inductive biases into machine learning architectures for trustworthy industrial forecasting.
Methodology
DSPR employs a dual-stream architecture that separates the modeling of stable inertial trends from regime-dependent residual dynamics. It utilizes an Adaptive Window module to learn flow-dependent transport delays and a Physics-Guided Dynamic Graph to incorporate physical priors, allowing the model to capture time-varying interactions while suppressing spurious correlations.
Results
DSPR demonstrated significant improvements in forecasting accuracy and robustness across four industrial datasets, achieving a Mean Conservation Accuracy exceeding 99% and a Total Variation Ratio of up to 97.2%. The model's ability to maintain physical plausibility while providing interpretable insights into dynamic interactions was validated through extensive experiments.
Implications
The findings suggest that integrating physics-informed inductive biases into machine learning frameworks can lead to more reliable and interpretable forecasting models in industrial applications. This approach not only enhances predictive accuracy but also supports mechanism-level analysis, which is crucial for safety-critical systems.
Extraction of linearized models from pre-trained networks via knowledge distillation
Efficient ML
Theory
- Proposes a framework for extracting linearized models from pre-trained neural networks using knowledge distillation.
- Integrates Koopman operator theory to approximate nonlinear transformations as linear systems.
- Demonstrates improved classification accuracy and numerical stability over conventional methods.
- Utilizes principal component analysis to incorporate weak nonlinearity in the model.
Read more
Extraction of linearized models from pre-trained networks via knowledge distillation
Summary
This paper addresses the challenge of improving the energy efficiency of machine learning architectures, particularly in the context of optical devices that excel in linear operations. The authors propose a novel framework that leverages knowledge distillation to extract linearized models from pre-trained neural networks, specifically targeting classification tasks. By integrating Koopman operator theory, the study approximates the nonlinear transformations of hidden layers as a linear system in a higher-dimensional observable space. This approach not only enhances the accuracy of classification tasks but also allows for the use of weak nonlinearity through principal component analysis (PCA). The proposed method is validated through numerical experiments on the MNIST and Fashion-MNIST datasets, demonstrating superior performance compared to traditional least-squares-based Koopman approximations in terms of classification accuracy and numerical stability. The findings suggest a promising direction for developing energy-efficient machine learning systems compatible with optical hardware.
Methodology
The authors utilize knowledge distillation to extract linearized models from pre-trained networks, applying Koopman operator theory to approximate nonlinear transformations in hidden layers. The methodology combines regression characteristics from Koopman theory with classification tasks, and incorporates PCA for initial-stage processing.
Results
The proposed method outperformed conventional least-squares-based Koopman approximations in classification tasks on the MNIST and Fashion-MNIST datasets, achieving higher accuracy and greater numerical stability.
Implications
The findings have significant implications for the development of energy-efficient machine learning systems, particularly in the context of optical devices. By reducing reliance on nonlinear operations, the proposed framework could facilitate the implementation of machine learning models in hardware with limited computational resources.
SBBTS: A Unified Schrödinger-Bass Framework for Synthetic Financial Time Series
Generative Models
Time Series
Optimization
- Introduces SBBTS, a unified framework for generating synthetic financial time series.
- Jointly models drift and stochastic volatility, overcoming limitations of existing methods.
- Demonstrates improved forecasting performance and data augmentation capabilities.
- Empirical validation on both synthetic benchmarks and real financial data.
Read more
SBBTS: A Unified Schrödinger-Bass Framework for Synthetic Financial Time Series
Summary
This paper addresses the challenge of generating synthetic financial time series that accurately reflect both marginal distributions and temporal dynamics, which is crucial for applications in finance. Traditional methods often struggle to simultaneously model drift and stochastic volatility. The authors propose the Schrödinger–Bass Bridge for Time Series (SBBTS), a novel framework that extends the Schrödinger–Bass formulation to multi-step time series. SBBTS constructs a diffusion process that jointly calibrates drift and volatility, allowing for a tractable decomposition into conditional transport problems, which facilitates efficient learning. The framework is empirically validated through numerical experiments on the Heston model, demonstrating its ability to recover stochastic volatility and correlation parameters that previous methods could not capture. When applied to S&P 500 data, synthetic time series generated by SBBTS significantly enhance downstream forecasting performance, yielding improved classification accuracy and Sharpe ratios compared to models trained solely on real data. This indicates that SBBTS is a practical and effective tool for realistic time series generation and data augmentation in financial contexts.
Methodology
The SBBTS framework combines optimal transport with modern machine learning techniques to reproduce both marginal distributions and temporal dynamics. It extends the Schrödinger-Bass problem from two-marginal settings to full time series distributions, allowing for joint calibration of drift and volatility. The resulting problem is decomposed into a sequence of conditional optimal transport problems, which are computationally tractable. A scalable neural implementation captures path-dependent dynamics.
Results
The SBBTS framework successfully recovers stochastic volatility and correlation structures in numerical experiments. When applied to S&P 500 data, the synthetic time series generated by SBBTS consistently improve classification accuracy and Sharpe ratios in downstream forecasting tasks compared to training on real data alone.
Implications
The SBBTS framework has significant implications for financial modeling, particularly in scenarios where real data is scarce or sensitive. It can be used for stress testing, risk management, and enhancing predictive models through data augmentation.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Computer Vision
Generative Models
Multimodal
- Introduces BrainCoDec, a training-free method for cross-subject brain decoding.
- Achieves generalization to novel subjects without fine-tuning or anatomical alignment.
- Utilizes a hierarchical inference process for robust visual decoding.
- Demonstrates strong performance across diverse visual backbones and scanning protocols.
Read more
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Summary
This paper addresses the challenge of visual decoding from brain signals, particularly the variability in neural representations across individuals that necessitates bespoke models or fine-tuning for each subject. The authors propose a novel meta-optimized approach for semantic visual decoding from fMRI data, which allows for generalization to new subjects without any fine-tuning. The method, named BrainCoDec (Brain In-Context Decoding), leverages a hierarchical inference process to infer unique neural encoding patterns from a small set of image-brain activation examples. This approach consists of two stages: first, estimating per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses; second, performing aggregated functional inversion using a context of encoder parameters and response values across multiple voxels. The results demonstrate strong cross-subject and cross-scanner generalization, requiring neither anatomical alignment nor stimulus overlap. This work represents a significant advancement towards a generalizable foundation model for non-invasive brain decoding, with implications for applications in brain-computer interfaces, cognitive assessment, and personalized diagnostics.
Methodology
The methodology involves a two-stage hierarchical inference process. In the first stage, the model estimates visual response function weights for individual voxels using a context constructed from stimuli and brain activity pairs. In the second stage, it integrates these parameters across multiple voxels to perform functional inversion, reconstructing visual stimuli from brain activity without requiring extensive retraining or fine-tuning.
Results
The proposed method shows strong generalization capabilities across different subjects and scanners, achieving effective visual decoding without the need for anatomical alignment or stimulus overlap. The performance improves with the number of images and voxels used, indicating robustness to input variability.
Implications
This research has significant implications for developing universal models of brain function, enhancing brain-computer interfaces, and improving cognitive assessments and personalized diagnostics. It paves the way for scalable applications in understanding neural representations across populations.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Computer Vision
Interpretability
- Introduces two new metrics for evaluating model generalization based on internal mechanisms.
- Dependency Depth Bias (DDB) quantifies reliance on deep versus shallow features for model selection.
- Circuit Shift Score (CSS) detects performance degradation under distribution shifts.
- Both metrics show improved correlation with OOD performance, outperforming existing methods.
Read more
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Summary
This paper addresses the challenge of evaluating the generalization performance of machine learning models, particularly Vision Transformers, in scenarios where labeled data is scarce. The authors propose a novel approach that leverages the internal mechanisms of models, specifically their 'circuits', to derive reliable, label-free proxy metrics for generalization. They introduce two metrics: Dependency Depth Bias (DDB) for model selection before deployment, which assesses a model's reliance on deep versus shallow features, and Circuit Shift Score (CSS) for monitoring performance after deployment, which measures deviations in the model's circuit structure under distribution shifts. The study demonstrates that these metrics significantly improve the correlation with out-of-distribution (OOD) performance compared to traditional proxies, thereby providing a more robust framework for evaluating model reliability in real-world applications.
Methodology
The authors utilize circuit discovery techniques to extract causal interactions between internal representations of Vision Transformers. They analyze the structural patterns of these circuits to develop the DDB and CSS metrics, which are then validated across various tasks and datasets.
Results
The proposed metrics, DDB and CSS, demonstrate a 13.4% and 34.1% improvement in correlation with OOD performance, respectively, compared to existing proxy metrics. Additionally, CSS achieves a ∼45% gain in detection F1 for early detection of silent failures in deployed models.
Implications
The findings suggest that understanding a model's internal workings can provide more reliable indicators of generalization performance, which is crucial for high-stakes applications. The proposed metrics can enhance model selection and monitoring processes, potentially leading to more robust AI systems in real-world scenarios.
Automating aggregation strategy selection in federated learning
Federated Learning
- Introduces an automated framework for selecting aggregation strategies in Federated Learning.
- Utilizes large language models for single-trial strategy inference and genetic search for multi-trial exploration.
- Demonstrates improved robustness and generalization under non-IID conditions.
- Reduces reliance on manual intervention in strategy selection.
Read more
Automating aggregation strategy selection in federated learning
Summary
This paper addresses the challenge of selecting appropriate aggregation strategies in Federated Learning (FL), which is crucial for effective model training without centralizing data. The authors propose an end-to-end framework that automates the selection of aggregation strategies based on the statistical heterogeneity of datasets and varying compute constraints. The framework operates in two modes: a single-trial mode utilizing large language models (LLMs) for strategy inference based on data characteristics, and a multi-trial mode employing a lightweight genetic search to explore alternative strategies under resource constraints. The extensive experiments conducted demonstrate that the proposed approach enhances robustness and generalization in non-IID conditions while minimizing the need for manual intervention. This work significantly contributes to making FL more accessible and adaptive by automating a critical design decision.
Methodology
The proposed framework operates in two modes: a single-trial mode that leverages large language models to infer suitable aggregation strategies based on user-provided or automatically detected data characteristics, and a multi-trial mode that employs a lightweight genetic search to efficiently explore alternative strategies within constrained budgets.
Results
The experiments show that the automated framework significantly enhances the performance of FL under non-IID conditions, improving robustness and generalization while reducing the need for manual strategy selection. The results indicate that the framework effectively adapts to varying levels of statistical heterogeneity and compute constraints.
Implications
This work has the potential to facilitate the practical deployment of Federated Learning by simplifying the aggregation strategy selection process, making it more accessible for practitioners without deep expertise in FL. It could lead to broader adoption of FL in various applications where data privacy and decentralization are critical.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
Time Series
- Introduction of the ADAPT framework for many-to-one pre-training in time-series classification.
- Achieves state-of-the-art performance on 162 diverse time-series datasets.
- Utilizes average adaptive pooling for mixed-batch training, accommodating varying input dimensions.
- Addresses fundamental challenges in building generalist models for time-series data.
Read more
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
Summary
The paper introduces a novel pre-training paradigm for time-series classification called ADAPT, which addresses the challenges of generalizing models across diverse datasets. Traditional pre-training methods have struggled in many-to-one scenarios, where models are trained on multiple datasets simultaneously. The authors propose a framework that aligns the physical properties of time-series data, enabling mixed-batch pre-training despite variations in input sizes and channel dimensions. By training on 162 time-series classification datasets, the ADAPT framework achieves state-of-the-art performance, demonstrating its effectiveness in creating generalist foundation models for time-series analysis. The methodology includes the use of average adaptive pooling during data loading to facilitate mixed batch training, ensuring that models can process inputs of varying dimensions and modalities. This approach not only enhances model performance but also paves the way for future advancements in time-series-specific architectures.
Methodology
The authors developed the ADAPT framework, which employs average adaptive pooling to enable mixed-batch training of time-series data. This method allows the model to handle inputs of varying lengths and channel dimensions, facilitating the simultaneous training on multiple datasets.
Results
The ADAPT framework set new benchmarks in time-series classification, achieving state-of-the-art performance across 162 datasets. The results indicate significant improvements in model generalization and performance compared to traditional pre-training methods.
Implications
The ADAPT framework has the potential to revolutionize time-series analysis by enabling the development of generalist foundation models that can be applied to a wide range of real-world applications across various domains, including medical, financial, and environmental fields.
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Computer Vision
Efficient ML
Robotics
- Introduction of CPS-Prompt framework for efficient continual learning on edge devices.
- Utilization of Critical Patch Sampling (CPS) for effective token reduction.
- Implementation of Decoupled Prompt and Classifier Training (DPCT) to reduce backpropagation overhead.
- Demonstrated significant improvements in memory usage, training time, and energy efficiency.
Read more
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Summary
This paper presents CPS-Prompt, a novel framework for continual learning (CL) on edge devices that emphasizes training-time efficiency while maintaining high accuracy. Traditional prompt-based continual learning methods have focused primarily on accuracy or inference-time performance, often neglecting the memory and computational costs associated with on-device training. CPS-Prompt addresses these challenges by integrating two key components: Critical Patch Sampling (CPS) for task-aware token reduction and Decoupled Prompt and Classifier Training (DPCT) to minimize backpropagation overhead. The framework is designed to operate under the constraints of edge devices, which require efficient memory usage and computational resources. Experimental evaluations on three public benchmarks and real edge hardware demonstrate that CPS-Prompt achieves a 1.6× improvement in peak memory, training time, and energy efficiency compared to the CODA-Prompt baseline, while maintaining accuracy within 2% of the state-of-the-art C-Prompt. This work highlights the importance of optimizing training-time efficiency in continual learning systems deployed on resource-constrained devices.
Methodology
The CPS-Prompt framework employs two main strategies: Critical Patch Sampling (CPS) selects task-specific patches to reduce memory usage during training, while Decoupled Prompt and Classifier Training (DPCT) optimizes the training process by separating the learning of prompts and classifiers. This two-phase approach allows for efficient adaptation to new tasks while minimizing computational overhead.
Results
CPS-Prompt shows a 1.6× improvement in peak memory, training time, and energy efficiency over the CODA-Prompt baseline. It maintains accuracy within 2% of the C-Prompt, demonstrating its effectiveness in balancing efficiency and performance.
Implications
The findings suggest that CPS-Prompt can significantly enhance the feasibility of deploying continual learning systems on edge devices, making it suitable for applications in robotics, smart devices, and other resource-constrained environments where efficient on-device training is critical.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Generative Models
Efficient ML
Computer Vision
- Identifies the mismatch between data complexity and model readiness as a source of inefficiency in diffusion training.
- Introduces a semantic-aware image complexity metric that combines foreground dominance and typicality.
- Demonstrates significant improvements in generation quality metrics (IS and FID) with the proposed curriculum strategy.
- Establishes that the order of image complexity presentation is critical for performance gains.
Read more
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Summary
The paper addresses inefficiencies in diffusion training caused by randomly initialized networks encountering a wide range of image complexities. The authors propose 'Data Warmup', a curriculum learning strategy that schedules training images from simple to complex based on a semantic-aware complexity metric. This metric evaluates images based on foreground dominance and typicality, allowing the model to build visual priors gradually. The method employs a temperature-controlled sampler to prioritize low-complexity images initially, transitioning to uniform sampling as training progresses. Experiments on ImageNet 256×256 with SiT backbones demonstrate that Data Warmup significantly improves Inception Score (IS) and Fréchet Inception Distance (FID), achieving baseline quality much faster than traditional methods. The authors also highlight that reversing the curriculum order leads to performance degradation, confirming that the simple-to-complex progression is crucial for the observed gains. The approach requires minimal preprocessing time and can be combined with existing training accelerators, making it a practical enhancement for diffusion model training.
Methodology
The authors developed a semantic-aware complexity metric that evaluates images based on two properties: foreground dominance and foreground typicality. This metric is computed offline using pretrained DINO-v2 features. A temperature-controlled sampler is then used to bias the training process towards simpler images initially, gradually transitioning to a uniform sampling strategy as training progresses.
Results
Data Warmup improved the Inception Score (IS) by up to 6.11 and the Fréchet Inception Distance (FID) by up to 3.41 on ImageNet 256×256 across various SiT scales. The method allowed models to reach baseline quality significantly faster, with performance degradation observed when the curriculum was reversed.
Implications
The findings suggest that curriculum learning strategies can be effectively applied to generative models, potentially leading to more efficient training processes in various applications of diffusion models. This approach may also inform future research on optimizing training methodologies in deep learning.
Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health
Optimization
Reinforcement Learning
Theory
- Restless Multi-Armed Bandits effectively optimize limited public health interventions.
- Decision-focused learning enhances outcomes in predict-then-optimize scenarios.
- Long-term studies showed significant improvements in adherence to mHealth programs and maternal health behaviors.
Read more
Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health
Summary
The SAHELI project, conducted from 2020 to 2025, aimed to improve maternal and child health behaviors through the application of Restless Multi-Armed Bandits (RMAB) in mobile health (mHealth) interventions. The project was a collaboration between AI researchers and ARMMAN, a leading NGO in maternal and child health in India. The RMAB framework was employed to optimize the allocation of limited health resources, specifically live service calls to beneficiaries of the mMitra program, which delivers preventive health information via automated voice messages. The project addressed the challenge of maintaining engagement among beneficiaries, who often faced logistical issues and a lack of understanding regarding the importance of continued participation. Through a two-stage learning approach, the SAHELI system was able to predict listenership patterns and develop a scheduling policy for live calls. The deployment of SAHELI from April 2022 positively impacted over 350,000 mothers, demonstrating significant improvements in program engagement and maternal health behaviors. The findings highlight the potential of AI-driven solutions in optimizing public health interventions, particularly in resource-constrained settings.
Methodology
The methodology involved the development of a Restless Multi-Armed Bandit model tailored for mHealth interventions, focusing on predicting beneficiary engagement and optimizing the scheduling of live service calls. The approach included two-stage learning and decision-focused learning techniques to enhance the effectiveness of resource allocation.
Results
The SAHELI system led to sustained improvements in beneficiary engagement and statistically significant positive shifts in maternal health behaviors among participants. The deployment benefited over 350,000 mothers, demonstrating the effectiveness of AI-augmented outreach in public health.
Implications
The findings suggest that AI-driven optimization can significantly enhance the effectiveness of health interventions in low-resource settings, providing a blueprint for similar applications in global health initiatives. This approach can help allocate scarce resources more effectively, ultimately improving health outcomes for vulnerable populations.
How Does Machine Learning Manage Complexity?
Theory
- Machine learning models can effectively manage complexity through probabilistic outcomes.
- The paper abstracts machine learning to P/poly-computable distributions with polynomially-bounded max-entropy.
- A key theorem shows that learned distributions from cryptographic pseudorandom generators are close to uniform.
- The strength of machine learning models is derived from their ability to generate random guesses rather than specific answers.
Read more
How Does Machine Learning Manage Complexity?
Summary
In this paper, Lance Fortnow provides a computational complexity perspective on the capabilities of machine learning models, particularly their effectiveness in modeling complex systems. The author argues that machine learning models, when trained on data from computable distributions, can manage complexity through probabilistic outcomes. By abstracting the learning mechanisms, the paper models machine learning outputs as P/poly-computable distributions with polynomially-bounded max-entropy. The author illustrates that if a model produces a distribution that minimizes error against a distribution from a cryptographic pseudorandom generator, then this distribution must be close to uniform. The paper discusses the implications of using computable distributions for learning complex behaviors and presents a theorem demonstrating that the learned distribution is information-theoretically indistinguishable from the uniform distribution. The findings suggest that the strength of modern machine learning lies in its probabilistic nature and the constraints imposed by computable distributions, which allow for effective representation of complex behaviors.
Methodology
The author employs a theoretical framework that combines concepts from computational complexity, information theory, and cryptography. The paper abstracts machine learning mechanisms to model outputs as P/poly-computable distributions and analyzes the properties of these distributions, particularly focusing on Kullback-Leibler divergence and max-entropy.
Results
The main result demonstrates that if a machine learning model produces a distribution that minimizes the Kullback-Leibler divergence from a distribution generated by a cryptographic pseudorandom generator, then this distribution is information-theoretically indistinguishable from a uniform distribution. This indicates that the learned distribution effectively captures the complexity of the original distribution.
Implications
The findings have implications for understanding the theoretical foundations of machine learning, particularly in how models can handle complex data distributions. This perspective can inform the design of more robust machine learning algorithms that leverage probabilistic reasoning and complexity management.
Production-Ready Automated ECU Calibration using Residual Reinforcement Learning
Reinforcement Learning
Optimization
Interpretability
- Introduces a residual reinforcement learning approach for automated ECU calibration.
- Demonstrates the methodology using a map-based air path controller in a HiL environment.
- Achieves faster calibration with minimal human intervention compared to traditional methods.
- Ensures explainability and safety in the calibration process.
Read more
Production-Ready Automated ECU Calibration using Residual Reinforcement Learning
Summary
This paper addresses the challenges of calibrating Electronic Control Units (ECUs) in modern vehicles, where traditional manual calibration methods are becoming impractical due to increasing complexity, tighter emission regulations, and shorter development cycles. The authors propose a novel approach using residual reinforcement learning (RL) to automate the calibration process while ensuring explainability and adherence to established automotive development principles. The methodology is demonstrated through a map-based air path controller on a hardware-in-the-loop (HiL) platform. The proposed system begins with a sub-optimal calibration map and efficiently converges to a solution that closely matches the reference calibration of a series ECU. The results indicate that this automated calibration process significantly reduces the time required for calibration and minimizes the need for human intervention, making it suitable for industrial applications. The paper outlines the methodology, the design of the automated calibration pipeline, and validates the performance of the approach under real conditions, highlighting its potential to improve initial calibrations while balancing emissions, fuel consumption, and performance demands.
Methodology
The methodology employs residual reinforcement learning to automate the calibration of ECUs. It involves creating a training agent that interacts with the environment to optimize control strategies while generating its own training data. The approach integrates established automotive development principles and focuses on deriving look-up tables from trained RL agents to ensure safety and explainability.
Results
The proposed methodology successfully converges to a calibration that closely resembles the reference calibration of a series ECU, demonstrating improved calibration quality in significantly less time. The automated process requires virtually no human intervention, making it a viable solution for industry applications.
Implications
This work has significant implications for the automotive industry, particularly in enhancing the efficiency of ECU calibration processes. By automating calibration with RL, manufacturers can meet stricter emission standards and customer expectations while reducing development time and costs.
Weighted Bayesian Conformal Prediction
Theory
- WBCP generalizes BQ-CP to importance-weighted settings, addressing distribution shifts.
- Theoretical results confirm the calibration consistency and improved coverage guarantees.
- Geographical BQ-CP offers spatial diagnostics, enhancing interpretability in spatial predictions.
- WBCP maintains coverage guarantees while providing richer uncertainty information.
Read more
Weighted Bayesian Conformal Prediction
Summary
This paper introduces Weighted Bayesian Conformal Prediction (WBCP), a novel framework that extends Bayesian Quadrature Conformal Prediction (BQ-CP) to handle distribution shifts through importance weighting. While BQ-CP provides data-conditional guarantees under the assumption of independent and identically distributed (i.i.d.) data, it fails to address scenarios where the calibration and test distributions differ. WBCP addresses this limitation by replacing the uniform Dirichlet distribution used in BQ-CP with a weighted Dirichlet distribution that incorporates effective sample sizes and importance weights. The authors prove several theoretical results, including the calibration consistency of the effective sample size, the decay rate of posterior standard deviation, and the extension of BQ-CP's stochastic dominance guarantee to weighted settings. Additionally, they instantiate WBCP for spatial prediction, termed Geographical BQ-CP, which provides interpretable diagnostics and maintains coverage guarantees while enriching uncertainty quantification. Experiments demonstrate that WBCP outperforms existing methods in providing richer uncertainty information in both synthetic and real-world datasets.
Methodology
The authors replace the uniform Dirichlet distribution in BQ-CP with a weighted Dirichlet distribution that incorporates effective sample sizes and importance weights. They derive theoretical results regarding calibration consistency, posterior concentration, and conditional coverage, and validate the approach through experiments on synthetic and real-world datasets.
Results
The paper presents four main theoretical results: (1) effective sample size as a unique concentration parameter, (2) posterior standard deviation decay rate of O(1/√neff), (3) extension of stochastic dominance guarantees to weighted settings, and (4) improved conditional coverage bounds. Experimental results show that WBCP maintains coverage guarantees while providing richer uncertainty information compared to existing methods.
Implications
WBCP has significant implications for applications requiring robust uncertainty quantification in the presence of distribution shifts, such as spatial prediction, domain adaptation, and high-stakes decision-making scenarios.
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
Generative Models
Graph Learning
Reinforcement Learning
- CausalVAE is proposed as a plug-in module for latent world models to enhance counterfactual dynamics.
- The integration of a structured causal disentanglement module allows for the identification of causal relationships among latent variables.
- A staged training strategy is introduced to stabilize sequential training and improve model interpretability.
- Significant performance gains were observed in counterfactual retrieval tasks, especially in physics-related benchmarks.
Read more
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
Summary
This paper introduces CausalVAE as a structural module that enhances latent world models by integrating causal representation learning to improve counterfactual dynamics. The authors argue that traditional world models, while effective in factual predictions, often fail to maintain robustness under distribution shifts and interventions due to entangled latent representations. By incorporating CausalVAE, the model can learn a directed acyclic graph (DAG) structure among latent variables, thereby enabling more reliable counterfactual reasoning. The proposed methodology includes a staged training strategy that first focuses on predictive dynamics before activating structural regularization. This approach allows for a more interpretable causal state space and enhances the model's performance in counterfactual tasks. The results demonstrate significant improvements in counterfactual retrieval, particularly on the Physics benchmark, where the model achieved a 102.5% increase in CF-H@1 metric compared to baseline models. The findings suggest that the integration of causal structures into world models can lead to better generalization and robustness in dynamic environments.
Methodology
The authors developed a causal structural module that integrates with existing world models, utilizing a CausalVAE causal layer to enforce a DAG structure among latent variables. They employed a staged training strategy that first learns predictive dynamics and then progressively incorporates structural regularization, anchored by alignment-only weak supervision to stabilize the training process.
Results
The integration of CausalVAE led to substantial improvements in counterfactual retrieval metrics, with an average increase of 102.5% in CF-H@1 across various benchmarks. In a specific GNN-NLL setting, the CF-H@1 metric rose from 11.0 to 41.0, marking a 272.7% improvement. These results indicate enhanced robustness and interpretability in the learned causal structures.
Implications
The findings suggest that incorporating causal structures into world models can significantly enhance their ability to generalize and perform under varying conditions. This has potential applications in areas requiring robust decision-making and predictive modeling, such as robotics, autonomous systems, and complex dynamic environments.
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
Computer Vision
Multimodal
- Introduction of PRAC, a novel attack on CUAs that manipulates attention via adversarial image patches.
- Demonstration of PRAC's effectiveness in redirecting CUA preferences in online shopping scenarios.
- Highlighting the vulnerability of vision modalities in CUAs, which has been less explored compared to language modalities.
- Validation of the attack in realistic deployment settings, showing high success rates.
Read more
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
Summary
This paper introduces PRAC, a novel attack targeting Computer Use Agents (CUAs) that utilize Large Vision Language Models (LVLMs) for autonomous interactions in graphical user interfaces (GUIs). While previous research has focused on vulnerabilities in the language modality, this work highlights the overlooked risks associated with the vision modality. PRAC manipulates the internal preferences of a CUA by redirecting its attention to a stealthy adversarial patch on a product image in an online shopping context. The authors demonstrate that this attack can effectively influence the CUA's selection process, leading it to recommend a specific product manipulated by an adversary. The attack requires white-box access to the model but generalizes to fine-tuned versions, posing a significant threat as many companies develop CUAs based on open-weight models. The study emphasizes the need for enhanced security measures to protect against such subtle manipulations that could exploit benign user actions.
Methodology
The authors developed PRAC by optimizing a stealthy perturbation on a product image to concentrate the CUA's attention on the adversarial patch. They conducted experiments in a mock online shopping environment to validate the effectiveness of the attack, measuring its success in altering the CUA's product selection.
Results
The results indicate that PRAC successfully redirected the CUA's attention towards the manipulated product image, leading to a significant change in the selection preference of the agent. The attack maintained a low perceptual distortion, ensuring that the manipulated image appeared benign to human users.
Implications
The findings suggest that CUAs are vulnerable to subtle adversarial manipulations, which could lead to harmful outcomes for users. This highlights the necessity for developing robust defenses against such attacks, particularly in applications involving automated decision-making in trusted environments.
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
NLP
Large Language Models
Graph Learning
- Introduces a feedforward graph architecture leveraging frozen LLMs for enhanced performance.
- Achieves strong benchmark results, outperforming single constituent models and parameter-matched classifiers.
- Demonstrates effective gradient flow through frozen model boundaries, enabling end-to-end training.
- Emergent selective routing behavior observed in the output node without explicit supervision.
Read more
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
Summary
This paper introduces a novel feedforward graph architecture that utilizes heterogeneous frozen large language models (LLMs) as computational nodes. These models communicate through a shared continuous latent space via learned linear projections. Building on previous findings regarding the geometric compatibility of independently trained LLM latent spaces, the authors extend this concept to create an end-to-end trainable multi-node graph. The architecture consists of three smaller frozen models that encode input into a shared latent space, which is then injected into two larger frozen models, culminating in a lightweight cross-attention output node. The system is designed to optimize projection matrices through backpropagation while keeping the majority of model parameters frozen. The proposed architecture demonstrates significant performance improvements on various benchmarks, showcasing the potential of aggregating knowledge from multiple models without requiring extensive retraining.
Methodology
The authors developed a feedforward graph architecture where multiple frozen LLMs serve as nodes. They utilized learned linear projections to facilitate communication between these nodes in a shared latent space. The architecture was trained end-to-end using backpropagation, optimizing only the projection matrices and a lightweight cross-attention output node while keeping the majority of model parameters frozen.
Results
The proposed architecture achieved 87.3% accuracy on the ARC-Challenge, 82.8% on OpenBookQA, and 67.2% on MMLU. These results surpassed the best single constituent model by 11.4, 6.2, and 1.2 percentage points, respectively, and outperformed parameter-matched learned classifiers by 9.1, 5.2, and 6.7 points.
Implications
This work suggests a new approach to leveraging existing frozen LLMs for improved task performance, highlighting the potential for efficient model composition and knowledge aggregation. The findings could influence future research on multi-agent systems and ensemble methods in NLP, as well as applications in specialized domains where smaller models excel.
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Generative Models
Computer Vision
- Introduction of the CGL-Dataset for training image-aware layout generation models.
- Development of two GAN-based models: CGL-GAN and PDA-GAN, with the latter using pixel-level domain adaptation.
- Proposal of three novel content-aware metrics for evaluating layout generation quality.
- PDA-GAN outperforms CGL-GAN, achieving significant improvements across multiple evaluation metrics.
Read more
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Summary
This paper presents a novel approach to generating image-aware graphic layouts for advertising posters using Generative Adversarial Networks (GANs). The authors introduce the Content-aware Graphic Layout Dataset (CGL-Dataset), which consists of 60,548 paired inpainted posters and 121,000 clean product images. The challenge lies in the domain gap created by inpainting artifacts, which can hinder the quality of layout generation. To address this, the authors propose two GAN-based models: CGL-GAN, which utilizes Gaussian blur to reduce the domain gap, and PDA-GAN, which incorporates unsupervised domain adaptation with a pixel-level discriminator to enhance layout generation based on the visual texture of input images. The paper also introduces three novel content-aware metrics to evaluate the models' performance in capturing the relationships between graphic elements and image content. Experimental results demonstrate that PDA-GAN achieves state-of-the-art performance, significantly improving the quality of generated layouts compared to CGL-GAN.
Methodology
The authors designed two GAN-based models to generate advertising poster layouts. CGL-GAN applies Gaussian blur to inpainted regions to reduce the domain gap, while PDA-GAN employs a pixel-level discriminator for unsupervised domain adaptation, allowing for fine-grained feature space alignment. The models were trained on the CGL-Dataset, and three new content-aware metrics were introduced for evaluation.
Results
PDA-GAN achieved state-of-the-art performance, outperforming CGL-GAN with relative improvements of 6.21% in background complexity, 17.5% in subject occlusion degree, 14.5% in product occlusion degree, and 19.07% in the content-aware Fréchet Inception Distance (cFID) metric, leading to enhanced visual quality in generated layouts.
Implications
The findings suggest that GAN-based models, particularly with domain adaptation techniques, can significantly improve the quality of graphic layout generation in advertising design. This has potential applications in automated graphic design tools, enhancing the efficiency and creativity of poster creation.
Learning to Query History: Nonstationary Classification via Learned Retrieval
Time Series
- Introduces a learned retrieval mechanism for nonstationary classification.
- Reframes nonstationary classification as a time series prediction problem.
- Demonstrates improved robustness to distribution shifts compared to standard classifiers.
- Allows for the use of large historical data corpora without requiring them to fit in memory.
Read more
Learning to Query History: Nonstationary Classification via Learned Retrieval
Summary
This paper addresses the challenge of nonstationarity in classification tasks, where models often struggle to generalize to new data distributions after training. The authors propose a novel approach that reframes nonstationary classification as a time series prediction problem. Instead of relying solely on the current input, the classifier is conditioned on a sequence of historical labeled examples that extends beyond the training cutoff. To efficiently manage large sequences of historical data, the authors introduce a learned discrete retrieval mechanism that samples relevant historical examples based on input-dependent queries. This mechanism is trained end-to-end with the classifier using a score-based gradient estimator, allowing the full corpus of historical data to remain on an arbitrary filesystem during training and deployment. The proposed method demonstrates improved robustness to distribution shifts in experiments conducted on synthetic benchmarks and the Amazon Reviews ’23 dataset, particularly in the electronics category. The results indicate that the system can effectively retrieve and utilize relevant historical context, enhancing classification performance without the need for retraining.
Methodology
The authors develop a system that uses a query generator to create input-dependent queries, which sample relevant historical examples from a large corpus. The historical data is associated with low-dimensional keys, and the system employs a hard attention-like retrieval mechanism to select relevant examples. The classifier is then trained to predict labels based on both the current input and the retrieved historical examples, optimizing the model's performance through an end-to-end training process.
Results
Experiments show that the proposed method significantly enhances robustness to distribution shifts when compared to traditional classifiers. The system scales effectively with the length of historical data sequences, maintaining performance even when the full corpus exceeds available memory. The joint training approach successfully learns to retrieve and exploit relevant historical context, leading to improved classification accuracy.
Implications
This work has potential applications in various fields where nonstationary data is prevalent, such as fraud detection, policy violation detection, and compliance monitoring. The ability to leverage historical data without retraining models can lead to more adaptive and resilient classification systems in dynamic environments.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
Computer Vision
Efficient ML
Multimodal
- Introduction of Kuramoto oscillatory Phase Encoding (KoPE) to Vision Transformers.
- KoPE enhances learning efficiency through synchronization of phase and rate representations.
- Demonstrated improvements in training, parameter, and data efficiency across multiple vision tasks.
- Facilitates attention learning and structural understanding in neural networks.
Read more
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
Summary
This paper introduces Kuramoto oscillatory Phase Encoding (KoPE), a novel approach that integrates an evolving phase state into Vision Transformers to enhance learning efficiency through neuro-inspired synchronization mechanisms. Unlike traditional deep learning architectures that primarily utilize activation values, KoPE incorporates phase representations that evolve according to Kuramoto dynamics, allowing for a joint representation of rate and phase. This method aims to address the limitations of current neural networks in efficiently learning structured representations from data. The authors demonstrate that KoPE significantly improves training, parameter, and data efficiency across various vision tasks, including semantic and panoptic segmentation, vision-language representation alignment, and few-shot abstract visual reasoning. Theoretical analyses and empirical results suggest that KoPE facilitates attention concentration, thereby enhancing the learning process. Overall, the study proposes a scalable, neuro-inspired mechanism that bridges the gap between biological neural dynamics and modern deep learning architectures.
Methodology
The authors developed KoPE by incorporating phase representations for each token in the Vision Transformer model, which evolve according to Kuramoto dynamics. This involves data-dependent coupling derived from token representations and integrating phases into the attention module through complex-form rotations, promoting synchronization dynamics that encourage structured learning.
Results
KoPE was shown to improve training efficiency, parameter efficiency, and data efficiency in various tasks. It outperformed traditional methods in semantic and panoptic segmentation, vision-language representation alignment, and few-shot abstract visual reasoning tasks. Theoretical and empirical analyses indicated that KoPE enhances attention concentration, leading to better learning outcomes.
Implications
The findings suggest that incorporating synchronization-based dynamics into neural architectures can significantly enhance learning efficiency and structured understanding in deep learning models. This approach may lead to advancements in various applications requiring efficient learning from limited data, such as computer vision and multimodal tasks.
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
Reinforcement Learning
Generative Models
Robotics
- Introduction of VGM2P, a flow-based policy learning framework for offline MARL.
- Utilization of global advantage values to guide agent collaboration.
- Implementation of classifier-free guidance MeanFlow for efficient action generation.
- Demonstrated comparable performance to advanced methods using only conditional behavior cloning.
Read more
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
Summary
The paper addresses the challenges of offline multi-agent reinforcement learning (MARL), particularly the trade-off between maximizing global returns and mitigating distribution shifts from offline data. The authors propose a novel framework called Value Guidance Multi-agent MeanFlow Policy (VGM2P), which enhances action generation efficiency through coefficient-insensitive conditional behavior cloning. VGM2P utilizes global advantage values to facilitate agent collaboration and employs classifier-free guidance MeanFlow to improve policy expressiveness and inference efficiency. The framework is designed to overcome the limitations of existing methods that rely on complex sampling processes, which can hinder training and inference efficiency. The authors demonstrate that VGM2P achieves performance comparable to state-of-the-art methods across various tasks with both discrete and continuous action spaces, even when trained solely via conditional behavior cloning.
Methodology
The proposed VGM2P framework treats optimal policy learning as conditional behavior cloning, integrating global advantage values into the training process. It employs classifier-free guidance MeanFlow to enhance the expressiveness of policies and improve action generation efficiency. The framework allows for one-step sampling based on preset conditions during decentralized execution, facilitating better exploration of learned policies.
Results
Experimental evaluations on various offline MARL benchmarks indicate that VGM2P achieves performance levels similar to existing advanced algorithms, demonstrating its effectiveness in both discrete and continuous action environments. The results highlight the framework's ability to efficiently generate actions while maintaining policy expressiveness.
Implications
The VGM2P framework has significant implications for real-world applications in multi-agent systems, such as multi-player strategy games, multi-robot control, and traffic management. Its efficiency and effectiveness in offline settings could lead to broader adoption of MARL techniques in scenarios where real-time interaction is costly or risky.
PolicyLong: Towards On-Policy Context Extension
NLP
Large Language Models
- PolicyLong proposes an on-policy framework for long-context training, addressing the off-policy gap in traditional methods.
- The iterative self-curriculum allows the model to continuously adapt its training data based on its evolving capabilities.
- Both positive contexts and hard negatives are derived from the current model's entropy landscape, enhancing learning efficiency.
- Experiments show significant performance improvements over baseline methods, especially with longer context lengths.
Read more
PolicyLong: Towards On-Policy Context Extension
Summary
The paper introduces PolicyLong, a novel framework aimed at enhancing the training of large language models (LLMs) by addressing the limitations of static long-context data construction. Traditional methods synthesize long-context data using a fixed base model, leading to an off-policy gap as the model evolves during training. PolicyLong shifts this paradigm to an on-policy approach, where data construction is iteratively updated based on the current model's capabilities. This iterative process involves re-executing data screening steps—entropy computation, retrieval, and verification—at each training stage, allowing the model to adaptively refine its training distribution. The framework not only selects positive contexts but also generates hard negatives that align with the model's learning trajectory, creating an implicit self-curriculum. Experimental results demonstrate that PolicyLong significantly outperforms existing methods, particularly as context lengths increase, confirming the advantages of on-policy data evolution.
Methodology
The methodology involves a dynamic on-policy data construction process where the current model iteratively re-evaluates and updates the training data at multiple stages. This includes entropy computation to identify high-uncertainty positions, retrieval of semantically similar candidates, and verification of long-range dependencies. The framework also incorporates hard negative context selection that evolves with the model's learning, ensuring that the training distribution remains aligned with the model's capabilities.
Results
Experiments conducted on datasets such as RULER, HELMET, and LongBench-v2 demonstrate that PolicyLong consistently outperforms baseline methods like EntropyLong and NExtLong. Notably, the performance gains increase with longer context lengths, with an observed improvement of +2.54 at 128K tokens on the RULER dataset, indicating the effectiveness of the on-policy approach.
Implications
The findings suggest that on-policy data construction can significantly enhance the training of large language models, potentially leading to better performance in tasks requiring long-context understanding. This approach may be applicable in various NLP tasks where context length is critical, paving the way for more effective training methodologies in the field.
Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling
NLP
Large Language Models
- Systematic evaluation of semantic calibration and discrimination reveals limitations of existing methods.
- Optimized token-level temperature scaling significantly improves semantic UQ compared to fixed-temperature heuristics.
- The proposed method enhances both semantic calibration and discrimination in question-answering tasks.
- A principled approach to response selection based on semantic confidence distributions yields better results.
Read more
Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling
Summary
This paper addresses the critical issue of calibration in semantic uncertainty quantification (UQ) for language models (LMs), particularly in question-answering tasks. The authors highlight that previous research has primarily focused on discrimination without adequately considering calibration, leading to an incomplete understanding of uncertainty. They systematically evaluate both calibration and discrimination across various confidence measures, revealing that existing methods, especially fixed-temperature heuristics, result in poorly calibrated and weakly discriminative semantic confidence distributions. The authors propose a novel approach using optimized token-level temperature scaling, which they argue serves as a simple yet effective solution for improving semantic UQ. Their extensive evaluations demonstrate that this method consistently enhances semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more complex recalibration techniques. The findings emphasize the importance of a principled approach to selecting final responses based on semantic confidence distributions, rather than relying on ad-hoc procedures. Overall, this work provides a comprehensive framework for improving the reliability of LMs in semantic tasks.
Methodology
The authors conducted a systematic evaluation of various semantic confidence measures to assess both calibration and discrimination. They introduced optimized token-level temperature scaling as a method for recalibrating semantic confidence distributions and compared its performance against fixed-temperature heuristics and other complex calibration methods. The evaluation included multiple language models and question-answering datasets to ensure robustness.
Results
The results indicated that optimized temperature scaling consistently improved semantic calibration and discrimination, leading to better performance in question-answering tasks. The method outperformed both heuristic baselines and more complex post-hoc calibration techniques, demonstrating its effectiveness in enhancing semantic uncertainty quantification.
Implications
The findings suggest that adopting optimized temperature scaling can significantly enhance the reliability of language models in semantic tasks, making them more trustworthy for applications in natural language generation and understanding. This work opens avenues for further research into calibration techniques in LMs, potentially leading to improved performance in various NLP applications.
Cluster Attention for Graph Machine Learning
Graph Learning
- Introduction of Cluster Attention (CLATT) as a new attention mechanism for graph learning.
- CLATT allows nodes to attend to other nodes within their clusters, enhancing receptive fields.
- Augmenting MPNNs and Graph Transformers with CLATT improves performance on diverse graph datasets.
- The method retains strong graph-structure-based inductive biases, crucial for GML tasks.
Read more
Cluster Attention for Graph Machine Learning
Summary
This paper introduces Cluster Attention (CLATT), a novel attention mechanism designed for Graph Machine Learning (GML) that addresses the limitations of existing models, particularly Message Passing Neural Networks (MPNNs) and Graph Transformers. MPNNs are effective but limited in their receptive field, as they only allow information exchange between neighboring nodes. On the other hand, Graph Transformers utilize global attention but lack the inductive biases provided by graph structures. CLATT enhances the receptive field by allowing nodes to attend to all other nodes within their respective clusters, identified using community detection algorithms. This approach retains the graph-structure-based inductive biases while enabling longer-range interactions among nodes. The authors demonstrate that augmenting both MPNNs and Graph Transformers with CLATT significantly improves performance across various real-world graph datasets, including those from the GraphLand benchmark. The paper provides a comprehensive analysis of the implementation of CLATT, its integration with existing models, and its effectiveness in capturing graph topology.
Methodology
The authors propose the Cluster Attention mechanism, which involves partitioning graph nodes into clusters using community detection algorithms. Each node can attend to all other nodes within its cluster, facilitating information exchange while preserving the graph's structural relationships. The implementation details include integrating CLATT with existing MPNNs and Graph Transformers to enhance their performance.
Results
The experimental results indicate that models augmented with CLATT outperform standard MPNNs and Graph Transformers on a wide range of graph datasets, showcasing significant improvements in predictive accuracy and robustness in real-world applications.
Implications
The introduction of CLATT has potential implications for various applications in graph machine learning, including social network analysis, molecular structure prediction, and any domain where understanding complex relationships in graph-structured data is crucial. The method could lead to more effective models that leverage both local and global information in graphs.
Equivariant Efficient Joint Discrete and Continuous MeanFlow for Molecular Graph Generation
Generative Models
Graph Learning
- Introduction of Equivariant MeanFlow (EQUIMF) for joint modeling of discrete and continuous graph components.
- Development of a new discrete MeanFlow model that enables efficient few-step sampling.
- Implementation of synchronized MeanFlow dynamics with mutual conditioning for improved generation quality.
- EQUIMF shows superior performance in molecular generation benchmarks compared to existing methods.
Read more
Equivariant Efficient Joint Discrete and Continuous MeanFlow for Molecular Graph Generation
Summary
This paper addresses the challenges in generative modeling of graph-structured data, particularly in molecular graph generation, which involves both discrete topology and continuous geometry. Existing methods often decouple these components, leading to inefficiencies and physically inconsistent outputs. The authors propose Equivariant MeanFlow (EQUIMF), a novel generative framework that integrates discrete and continuous components through synchronized MeanFlow dynamics. This framework utilizes a unified time bridge and mutual conditioning to enhance the generation process, allowing for efficient few-step sampling while maintaining physical consistency. The proposed discrete MeanFlow model features a simple parameterization that supports effective generation over discrete structures. The authors conduct extensive experiments, demonstrating that EQUIMF outperforms state-of-the-art diffusion and flow-matching methods in terms of generation quality, physical validity, and sampling efficiency, achieving nearly twice the speed of existing approaches.
Methodology
The authors propose a unified SE(3)-equivariant generative framework that models discrete graph structures and continuous 3D geometries through synchronized MeanFlow dynamics. This involves a new discrete MeanFlow formulation and a mutual conditioning mechanism that aligns the generation of structural and geometric representations. The framework is optimized using a joint loss function to ensure coherent generation.
Results
EQUIMF consistently outperforms prior methods in generation quality, physical validity, and sampling efficiency, achieving nearly twice the speed of state-of-the-art flow-matching and diffusion models in molecular generation tasks.
Implications
The proposed framework has significant implications for applications in chemistry, biology, and material science, where accurate and efficient molecular graph generation is crucial. It can enhance the design of new molecules and materials by providing reliable generative models.
KV Cache Offloading for Context-Intensive Tasks
NLP
Large Language Models
Efficient ML
- Introduction of the Text2JSON benchmark for evaluating context-intensive tasks.
- Significant performance degradation observed in existing KV-cache offloading methods.
- Identification of low-rank projection and unreliable landmarks as key issues affecting accuracy.
- Proposal of a simpler alternative strategy that improves performance across multiple LLMs.
Read more
KV Cache Offloading for Context-Intensive Tasks
Summary
This paper addresses the challenges posed by key-value (KV) cache in large language models (LLMs) when handling long-context inputs, particularly in context-intensive tasks that require extensive information retrieval. The authors introduce the Text2JSON benchmark, designed to evaluate the performance of KV-cache offloading techniques in scenarios where a significant amount of contextual information is necessary for accurate task completion. Through their evaluations on Text2JSON and other context-intensive tasks, they observe notable performance degradation in existing KV offloading methods, particularly with the Llama 3 and Qwen 3 models. The authors identify two primary reasons for this degradation: low-rank projection of keys and unreliable landmarks. They propose a simpler alternative strategy that enhances accuracy across various LLM families and benchmarks. The findings underscore the importance of rigorous evaluation of long-context compression techniques, especially in real-world applications that demand high contextual awareness.
Methodology
The authors systematically evaluate KV-cache offloading techniques across a range of benchmarks, including the newly introduced Text2JSON. They analyze the performance of modern LLMs (Llama 3 and Qwen 3) on context-intensive tasks and identify failure modes in existing offloading methods. The study includes a comparative analysis of accuracy and performance metrics to assess the effectiveness of their proposed alternative strategy.
Results
The evaluation reveals that existing KV-cache offloading techniques suffer from significant performance drops on context-intensive tasks. The proposed alternative strategy demonstrates improved accuracy across multiple LLM families and benchmarks, addressing the identified issues related to low-rank projections and unreliable landmarks.
Implications
The findings suggest that current KV-cache offloading methods may not be suitable for all types of tasks, particularly those requiring extensive context retrieval. This research emphasizes the need for improved techniques in handling long-context inputs in LLMs, which could enhance their applicability in real-world scenarios such as document translation, legal analysis, and complex problem-solving.
Tracking Adaptation Time: Metrics for Temporal Distribution Shift
Theory
Time Series
- Existing metrics fail to distinguish between adaptation lag and intrinsic data difficulty.
- Three new metrics are proposed to evaluate model adaptation under temporal distribution shifts.
- The study reveals that performance degradation may be misinterpreted as poor adaptation.
- Results indicate that the ID-OOD accuracy gap often reflects adaptation lag rather than a lack of generalization.
Read more
Tracking Adaptation Time: Metrics for Temporal Distribution Shift
Summary
This paper addresses the challenge of evaluating machine learning models under temporal distribution shifts, where data distributions evolve over time. Traditional metrics focus on average performance decline but do not adequately capture how models adapt to changing data. The authors propose three complementary metrics that differentiate between adaptation lag and intrinsic data difficulty, providing a more nuanced understanding of model behavior in dynamic environments. By applying these metrics to datasets from the Wild-Time benchmark, the authors demonstrate that performance degradation often reflects a temporal lag in adaptation rather than a complete failure to generalize. This insight encourages a reevaluation of how models are assessed under temporal shifts and highlights the importance of understanding the interplay between model adaptation and data complexity.
Methodology
The authors developed three complementary metrics designed to isolate the dynamics of model adaptation from the intrinsic difficulty of evolving data. They applied these metrics to datasets from the Wild-Time benchmark to analyze model performance over time, focusing on how adaptation patterns manifest in response to temporal distribution shifts.
Results
The application of the proposed metrics revealed that the observed ID-OOD accuracy gap is frequently due to adaptation lag rather than an inherent inability of models to generalize. This finding suggests that existing evaluations may misrepresent model performance under temporal shifts, emphasizing the need for metrics that capture temporal adaptation dynamics.
Implications
The proposed metrics can enhance the evaluation of machine learning models in real-world applications where data distributions change over time. By providing a clearer understanding of model adaptation, these metrics can inform the design of more robust adaptive learning algorithms, ultimately improving model performance in dynamic environments.
Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
Large Language Models
Reinforcement Learning
Optimization
- COMLLM integrates GRPO and LACS for effective task offloading in MEC.
- The framework captures long-term impacts of decisions on future system states.
- Achieves near-optimal latency and improved load-balancing fairness.
- Exhibits zero-shot scalability, generalizing to larger topologies without retraining.
Read more
Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
Summary
This paper addresses the challenges of task offloading in Mobile Edge Computing (MEC) systems, particularly in the context of computation-intensive applications that demand low latency. Traditional methods, including heuristics and Deep Reinforcement Learning (DRL), struggle with dynamic environments due to their limited adaptability and generalization capabilities. The authors propose COMLLM, a novel generative framework that integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism. This approach allows for multi-step Monte Carlo rollouts that model server queue dynamics and incorporate long-term impacts of decisions into the reward structure. The experimental results indicate that COMLLM achieves near-optimal latency and improved load-balancing fairness, demonstrating zero-shot scalability across different network topologies without requiring retraining. This positions COMLLM as a significant advancement over existing methods, including standard Supervised Fine-Tuning (SFT) and DRL, by effectively addressing the long-term dependencies and dynamic nature of MEC environments.
Methodology
The authors developed COMLLM, which combines GRPO with LACS to perform multi-step Monte Carlo rollouts. This allows the model to account for server queue dynamics and optimize decision-making by considering the long-term effects of current actions on future system states.
Results
Experimental evaluations show that COMLLM significantly outperforms existing methods, achieving near-optimal latency and enhanced load-balancing fairness. The framework's zero-shot scalability allows it to adapt to unseen network topologies without the need for retraining.
Implications
The proposed framework has the potential to improve task offloading strategies in MEC systems, making it suitable for various computation-intensive applications such as augmented reality and real-time video analysis. Its ability to generalize across different network configurations could lead to more efficient resource utilization in mobile networks.
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Efficient ML
Federated Learning
Large Language Models
- SOLAR reduces communication costs of PEFT methods by reparameterizing updates as linear combinations of basis vectors.
- The framework is model-agnostic and compatible with existing PEFT techniques, allowing for flexible integration.
- The method achieves up to 98% reduction in adapter sizes while preserving task performance.
- A theoretical analysis provides bounds on reconstruction error, ensuring reliability in performance.
Read more
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Summary
The paper introduces SOLAR, a novel framework designed to enhance the efficiency of Parameter-Efficient Fine-Tuning (PEFT) methods by significantly reducing the communication and storage costs associated with model adaptation. SOLAR achieves this by reparameterizing PEFT updates as linear combinations of basis vectors derived from the foundation model's singular vectors, incorporating controlled random perturbations. This approach leverages the subspace similarity between the foundation model and task-specific updates, allowing for a compact representation of adapters that is model-agnostic and compatible with existing PEFT methods like LoRA and AdaLoRA. The authors provide a theoretical analysis bounding the reconstruction error, ensuring that the performance of the adapted models remains intact. Extensive experiments demonstrate that SOLAR can reduce adapter sizes by up to 98% while maintaining competitive accuracy across various language and vision tasks, making it a promising solution for deployment in resource-constrained environments such as edge devices and federated learning systems.
Methodology
SOLAR employs a three-step framework for post-hoc adapter compression: 1) constructing a basis pool from the foundation model's singular vectors with random perturbations, 2) selecting significant basis vectors based on a budget, and 3) reconstructing the adapter using only the selected coefficients and a random seed. This method does not require retraining and operates on already fine-tuned adapters.
Results
The experiments conducted on language and vision tasks using models such as LLaMA, GPT-2, and ViT show that SOLAR can reduce the size of adapters by up to 98% while maintaining competitive accuracy, demonstrating its effectiveness in reducing communication and storage overhead in distributed systems.
Implications
SOLAR's approach to reducing the size of model adaptations has significant implications for deploying large models in resource-constrained environments, such as edge devices and federated learning scenarios, where communication and storage efficiency are critical.
On the Price of Privacy for Language Identification and Generation
NLP
Large Language Models
Theory
- Approximate differential privacy incurs no cost for language identification and generation tasks.
- Under pure differential privacy, the degradation in performance is characterized by a factor of min{1,ε}.
- Language generation exhibits a tighter privacy-utility tradeoff compared to language identification.
- The study provides a complete characterization of the price of privacy for language learning tasks.
Read more
On the Price of Privacy for Language Identification and Generation
Summary
This paper investigates the cost of privacy in language identification and generation tasks within the framework of differential privacy (DP). The authors establish algorithms and matching lower bounds for both tasks in an agnostic statistical setting, revealing that under approximate DP, privacy incurs no statistical penalty, while under pure DP, the error rates degrade by a factor of min{1,ε}. The study highlights that the cost of privacy is surprisingly mild, with approximate DP yielding non-degraded performance and pure DP leading to a controlled degradation in performance. The authors also compare the privacy-utility tradeoff between identification and generation, finding that generation benefits from a tighter tradeoff due to its structural advantages. The results provide a comprehensive characterization of the price of privacy in language learning, addressing a significant gap in the theoretical understanding of differentially private algorithms for these tasks.
Methodology
The authors utilize an agnostic statistical language learning model and develop algorithms for language identification and generation that incorporate differential privacy. They replace non-private algorithms with smooth score functions and thresholded prefix counts to manage sensitivity and ensure privacy. The algorithms are privatized using the exponential mechanism for pure DP and the Gaussian mechanism for approximate DP.
Results
The paper establishes that under approximate (ε,δ)-DP, the error rates for both language identification and generation match non-private rates. Under pure ε-DP, the error rates degrade by a factor of min{1,ε}, with generation achieving optimal rates that match upper and lower bounds up to constants. The findings indicate that the cost of privacy is minimal, particularly for generation tasks.
Implications
The results suggest that it is feasible to train large language models on sensitive data without incurring significant performance penalties, thereby encouraging the adoption of differentially private methods in practical applications. This work also lays the groundwork for future research on privacy-preserving algorithms in language learning.
Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
Generative Models
Optimization
Time Series
- Introduction of the Reconstruction Exposure-Bias concept linking training and inference errors.
- Development of an Adaptive Noise Schedule to optimize reconstruction error and exposure bias.
- Proposal of a fast Proxy Unrolled Training method to enhance stability and reduce computational costs.
- Demonstrated improvements in accuracy and stability over existing diffusion and deterministic models.
Read more
Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
Summary
This paper addresses the limitations of Conditional Diffusion Models in emulating complex spatiotemporal dynamics, particularly in terms of reconstruction accuracy and computational efficiency. The authors identify the phenomenon of Diffusion Exposure-Bias, which arises from discrepancies between training and inference processes, leading to suboptimal reconstruction errors. They propose an Adaptive Noise Schedule framework that minimizes reconstruction error by dynamically constraining exposure bias during inference. Additionally, a Proxy Unrolled Training method is introduced to stabilize long-term rollouts without the computational burden of full Markov Chain sampling. The proposed methods demonstrate significant improvements in both short-term accuracy and long-term stability across various benchmarks, including fluid dynamics tasks such as forced Navier-Stokes and Kuramoto-Sivashinsky equations.
Methodology
The authors characterize the relationship between noise schedules and reconstruction error reduction, proposing an Adaptive Scheduling algorithm that minimizes inference reconstruction error while maintaining model stability. They also develop a Proxy Unrolled Training method that leverages the optimized noise schedule to enhance training efficiency and accuracy.
Results
The proposed methods lead to significant reductions in reconstruction error and improved stability, achieving multiple orders of magnitude improvement in Fréchet Spectral Distance on Kolmogorov turbulent flow compared to baseline models.
Implications
The findings suggest that the proposed Adaptive Noise Schedule and Proxy Unrolled Training can enhance the performance of diffusion models in high-precision tasks, making them more competitive with deterministic approaches in fluid dynamics and potentially other spatiotemporal forecasting applications.
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch
Reinforcement Learning
Large Language Models
Efficient ML
- QaRL minimizes the training-inference mismatch in quantized rollouts for LLMs.
- Introduces TBPO to stabilize training by controlling updates within a trust region.
- Demonstrates significant performance improvements over traditional quantized rollout training.
- Achieves a 1.3x training speedup while maintaining high accuracy.
Read more
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch
Summary
The paper presents QaRL, a novel approach to reinforcement learning (RL) that addresses the inefficiencies and instabilities caused by the training-inference mismatch in large language models (LLMs). Traditional RL pipelines for LLMs are hindered by slow rollout generation, which is exacerbated when quantization is applied to accelerate decoding. This leads to a significant gap between the low precision used during rollouts and the full precision used for learning updates, destabilizing the optimization process. QaRL aligns the training-side forward pass with the quantized rollout to minimize this mismatch. Additionally, the authors identify a critical failure mode in quantized rollouts where long-form responses generate repetitive and garbled tokens, termed 'error tokens'. To combat this, they introduce Trust-Band Policy Optimization (TBPO), a sequence-level objective that employs dual clipping for negative samples, ensuring that updates remain within a trust region. Extensive experiments demonstrate that QaRL not only improves stability but also enhances performance, achieving a notable average math score of 51.2 on the Qwen3-30B-A3B model, closely matching the full precision BF16 training score of 52.1 while providing a 1.3x speedup in training time.
Methodology
The authors propose a quantization-aware RL pipeline that aligns the training process with quantized rollouts. They implement TBPO, which utilizes dual clipping to manage negative samples and maintain updates within a defined trust region. The methodology involves extensive experimentation on various benchmarks, focusing on math problem-solving capabilities of the Qwen3-30B-A3B model.
Results
QaRL outperforms traditional quantized rollout training, achieving an average math score of 51.2 compared to 45.7 for quantized rollouts and 52.1 for full precision BF16 training. The approach also provides a 1.3x speedup in training time, demonstrating both improved stability and efficiency.
Implications
The findings suggest that QaRL can significantly enhance the training efficiency and stability of LLMs in RL settings, making it a valuable approach for applications requiring rapid and reliable model updates, particularly in complex reasoning tasks.
Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Reinforcement Learning
Large Language Models
Optimization
- MOLREACT integrates reinforcement learning with LLMs to optimize lead compounds in drug discovery.
- The framework ensures synthesizability by using validated reaction templates for molecular modifications.
- A caching mechanism significantly reduces computational costs during the optimization process.
- MOLREACT outperforms existing methods in property optimization tasks while maintaining sample efficiency.
Read more
Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Summary
This paper presents MOLREACT, a novel framework for lead optimization in drug discovery that integrates reinforcement learning (RL) with large language models (LLMs) to ensure synthesizability of proposed molecular modifications. Traditional methods often fail to balance property optimization with synthetic feasibility, leading to chemically invalid structures. MOLREACT formulates lead optimization as a Markov Decision Process (MDP) over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent identifies reactive sites and proposes a targeted set of transformations, while a dedicated policy model, trained via Group Relative Policy Optimization (GRPO), selects actions to maximize long-term rewards. To enhance efficiency, a SMILES-based caching mechanism reduces the computational cost of LLM calls. The framework was evaluated on 13 property optimization tasks and one structure-based docking task, achieving an average Top-10 score of 0.563, outperforming the best synthesizable baseline by 10.4% and demonstrating superior sample efficiency on 10 of 14 tasks. The results indicate that both tool-augmented reaction proposals and trajectory-level policy optimization significantly contribute to the framework's success, producing molecules that improve desired properties while ensuring feasible synthetic pathways.
Methodology
MOLREACT formulates lead optimization as a Markov Decision Process (MDP) with a synthesis-constrained action space. It employs a tool-augmented LLM to dynamically propose feasible reactions based on validated templates and uses a dedicated policy model trained via Group Relative Policy Optimization (GRPO) to select actions that maximize long-term rewards. A SMILES-based caching mechanism is implemented to reduce the computational burden of LLM calls during exploration.
Results
MOLREACT achieved an average Top-10 score of 0.563 across 13 property optimization tasks and one structure-based docking task, outperforming the strongest synthesizable baseline by 10.4% in relative improvement. It demonstrated the best sample efficiency on 10 out of 14 tasks, confirming the effectiveness of its synthesis-aware approach.
Implications
The findings suggest that integrating LLMs with reinforcement learning can significantly enhance the efficiency and effectiveness of lead optimization in drug discovery. This approach could lead to more viable drug candidates by ensuring that molecular modifications are both property-enhancing and synthetically feasible, potentially accelerating the drug development process.
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
Large Language Models
Optimization
Efficient ML
- SAGE optimizes memory usage by replacing the second moment state of AdamW with a dimension-wise adaptive scale.
- The optimizer effectively addresses the unique challenges posed by high-variance gradients in embedding layers.
- SAGE achieves state-of-the-art perplexity results while significantly reducing memory requirements compared to existing methods.
Read more
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
Summary
The paper introduces SAGE (Sign Adaptive GradiEnt), a novel optimizer designed to address the memory bottleneck associated with the AdamW optimizer in the training of Large Language Models (LLMs). While AdamW is widely used for its stability, it requires memory states equivalent to twice the model size, which limits batch sizes and model scaling. Existing light-state optimizers, such as SinkGD, struggle with the high-variance gradients typical of embedding layers, necessitating a hybrid approach that reverts to AdamW, thus negating memory efficiency gains. SAGE overcomes this limitation by employing a single O(V d) moment state and a new O(d) dimension-wise adaptive scale that stabilizes high-variance gradients without the need for a second moment state. This design allows SAGE to achieve superior convergence and significantly reduce memory usage. The authors demonstrate that SAGE-based hybrids outperform existing optimizers in terms of test perplexity and memory footprint across models with up to 1.3 billion parameters.
Methodology
SAGE combines a Lion-style update direction with a new memory-efficient adaptive scale that tracks the mean absolute gradient. This design allows it to maintain stability in the presence of high-variance gradients typical of embedding layers, while only requiring a single moment state.
Results
SAGE-based hybrids demonstrated improved test perplexity over AdamW, Lion, and the SinkGD+Adam hybrid across models up to 1.3 billion parameters, while also achieving a significant reduction in optimizer state memory.
Implications
The development of SAGE could lead to more efficient training of large language models, enabling researchers and practitioners to scale models further without being constrained by memory limitations. This could facilitate advancements in NLP applications and model architectures.
The Impact of Dimensionality on the Stability of Node Embeddings
Graph Learning
- Dimensionality significantly affects the stability of node embeddings.
- Different embedding methods exhibit varied stability patterns with increasing dimensions.
- Higher dimensionality does not guarantee optimal performance in downstream tasks.
- The study highlights the trade-offs between stability, performance, and computational efficiency in graph representation learning.
Read more
The Impact of Dimensionality on the Stability of Node Embeddings
Summary
This paper investigates the influence of embedding dimensionality on the stability and performance of node embeddings generated by various methods. Previous studies have shown that node embeddings can yield different results even with identical parameters due to randomness in training. However, the effect of hyperparameters, particularly embedding dimension, on this instability has not been thoroughly analyzed. The authors systematically evaluate five popular node embedding techniques—ASNE, DGI, GraphSAGE, node2vec, and VERSE—across multiple datasets and varying dimensions. They assess stability from both representational and functional perspectives, alongside performance metrics for downstream tasks. The findings reveal that embedding stability is significantly affected by dimensionality, with different methods exhibiting varied stability patterns. Notably, while some methods like node2vec and ASNE show increased stability with higher dimensions, others do not follow this trend. Additionally, the study finds that maximum stability does not always correlate with optimal task performance, emphasizing the need for careful selection of embedding dimensions. The authors provide their experimental code to promote reproducibility.
Methodology
The authors conducted a systematic evaluation of five widely used node embedding methods (ASNE, DGI, GraphSAGE, node2vec, and VERSE) across multiple datasets and varying embedding dimensions. They assessed stability from both representational and functional perspectives, measuring the performance of downstream classifiers based on the generated embeddings.
Results
The results indicate that the stability of node embeddings varies significantly with dimensionality, with methods like node2vec and ASNE becoming more stable at higher dimensions, while others do not show the same trend. Furthermore, the study found that maximum stability does not necessarily align with the best performance on downstream tasks.
Implications
These findings suggest that researchers and practitioners should carefully consider the choice of embedding dimensions in graph representation learning to balance stability and performance. The insights provided can inform the design of more robust node embedding techniques and enhance the effectiveness of applications in social networks, recommendation systems, and other graph-based tasks.
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Theory
Optimization
Time Series
- Introduces a neural network-based approximation for the Basset force in MaRGE.
- Transforms complex integro-differential equations into solvable ordinary differential equations.
- Compares FNN and LSTM architectures for capturing historical effects in particle motion.
- Demonstrates the feasibility of using standard numerical solvers for the modified equations.
Read more
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Summary
This paper addresses the challenges associated with the Basset force in the Maxey-Riley-Gatignol equations (MaRGE), which model the motion of spherical inertial particles in a fluid. The Basset force, an integral term representing historical effects, complicates the numerical solution of MaRGE, leading to its frequent neglect despite its significant impact on particle dynamics. The authors propose a novel approximation of the Basset force using universal differential equations (UDEs) and neural networks, transforming the integro-differential equations into a system of ordinary differential equations (ODEs) that can be solved using standard numerical methods like Runge-Kutta. The paper compares the performance of a feedforward neural network (FNN) and a long short-term memory (LSTM) network to effectively capture the memory effects inherent in the Basset force. The methodology involves generating training data from the full MaRGE equations using a numerical solver, followed by testing the proposed approximation in various flow fields. The results demonstrate that the neural network-based approximation significantly simplifies the computational complexity while maintaining accuracy in modeling particle trajectories.
Methodology
The authors utilize universal differential equations to approximate the Basset force, replacing the integral term with a neural network. They generate training data using a numerical solver for the full MaRGE equations and evaluate the performance of both feedforward neural networks and LSTM networks in capturing the historical effects of the Basset force.
Results
The proposed neural network approximation effectively reduces the complexity of solving MaRGE, allowing for the use of standard numerical solvers. The performance of the FNN and LSTM models is assessed, showing that both architectures can accurately model the Basset force, with LSTM potentially offering better memory retention.
Implications
This work has significant implications for simulating the dynamics of inertial particles in various fluid environments, which is relevant in fields such as environmental science, industrial applications, and biomedical engineering. The ability to accurately model historical effects can enhance predictive capabilities in particle transport phenomena.
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Large Language Models
Efficient ML
- Flux Attention dynamically optimizes attention computation at the layer level, improving efficiency.
- The method integrates a lightweight Layer Router to adaptively assign layers to Full or Sparse Attention based on context.
- The approach achieves significant speed improvements in inference without sacrificing performance.
- Training is efficient, requiring only 12 hours on 8 GPUs, making it accessible for practical applications.
Read more
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Summary
The paper introduces Flux Attention, a novel context-aware hybrid attention mechanism designed to enhance the efficiency of large language models (LLMs) during inference, particularly in long-context scenarios. Traditional attention mechanisms, such as Full Attention (FA), suffer from quadratic computational complexity, which limits their scalability. Existing hybrid approaches that combine FA and Sparse Attention (SA) often use static allocation ratios, which do not adapt to the varying computational demands of different tasks. This paper addresses these limitations by proposing a dynamic layer-level routing mechanism that optimally assigns each layer to either FA or SA based on the input context. The authors integrate a lightweight Layer Router into pretrained LLMs, allowing for adaptive routing that maintains high information fidelity while ensuring efficient memory access. The training process is parameter-efficient, requiring only 12 hours on 8 GPUs. Experimental results demonstrate that Flux Attention significantly improves inference speed, achieving up to 2.8x speedup during the prefill phase and 2.0x during autoregressive decoding, while maintaining competitive performance across various benchmarks.
Methodology
The authors propose a context-aware framework called Flux Attention, which uses a lightweight Layer Router to dynamically route layers to either Full Attention (FA) or Sparse Attention (SA) based on the input context. This method preserves contiguous memory access and reduces computational load imbalances. The training involves freezing the backbone LLM parameters and updating only the Layer Router using a Gumbel-Softmax relaxation for differentiable routing.
Results
Flux Attention demonstrates substantial improvements in inference speed, achieving up to 2.8x speedup during the prefill phase and 2.0x during autoregressive decoding compared to baseline models. The method effectively adapts to varying task demands, maintaining high performance across multiple long-context and mathematical reasoning benchmarks.
Implications
The proposed Flux Attention framework has significant implications for the deployment of large language models in real-world applications, particularly in scenarios requiring efficient processing of long contexts. Its ability to dynamically adjust attention mechanisms based on task demands can enhance the usability and performance of LLMs in various domains, including document analysis and question answering.
Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling
Theory
- Introduction of Asymptotic-Preserving Neural Networks for viscoelastic parameter identification.
- Integration of physical principles into the neural network training process.
- Use of non-invasive patient-specific data for estimating pressure waveforms.
- Demonstration of methodology effectiveness through synthetic and real patient data simulations.
Read more
Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling
Summary
This paper presents a novel approach to identifying viscoelastic parameters in a one-dimensional multiscale blood flow model, which describes the viscoelastic properties of arterial walls. The authors introduce Asymptotic-Preserving Neural Networks (APNNs) that integrate the governing physical principles of the blood flow model into the learning process. This method allows for the reliable inference of viscoelastic parameters while simultaneously reconstructing the time-dependent evolution of blood vessel state variables. The APNN framework utilizes patient-specific data, such as cross-sectional area and velocity measurements from Doppler ultrasound, to estimate pressure waveforms in vascular segments where direct pressure measurements are not feasible. The effectiveness of the proposed methodology is demonstrated through various numerical simulations, including both synthetic and patient-specific scenarios, highlighting its practical applicability in cardiovascular modeling.
Methodology
The authors employ Asymptotic-Preserving Neural Networks (APNNs) that embed the governing equations of a viscoelastic blood flow model into the learning process. This approach allows the network to maintain consistency with the physical model while inferring parameters from available hemodynamic data.
Results
The numerical simulations conducted show that the APNN framework effectively reconstructs pressure waveforms and identifies viscoelastic parameters in both synthetic datasets and real patient data, demonstrating improved accuracy over traditional methods that do not account for viscoelasticity.
Implications
The proposed methodology has significant implications for non-invasive cardiovascular diagnostics, enabling more accurate assessments of arterial mechanics and potentially improving patient outcomes through better monitoring and treatment strategies.
Learning is Forgetting: LLM Training As Lossy Compression
NLP
Large Language Models
Theory
- LLMs are conceptualized as instances of lossy compression, retaining only relevant information from training data.
- Pre-training dynamics of LLMs align with Information Bottleneck theory, showing a two-phase trajectory of representation expansion followed by compression.
- The optimality of a model's compression correlates significantly with its performance across multiple benchmarks.
- Different LLMs compress information differently, influenced by their training data and methodologies.
Read more
Learning is Forgetting: LLM Training As Lossy Compression
Summary
This paper presents a novel perspective on the training dynamics of large language models (LLMs), framing their learning process as a form of lossy compression. The authors argue that LLMs retain only the information relevant to their training objectives, leading to an optimal compression of the training data. They demonstrate that during pre-training, LLMs follow a trajectory that aligns with the Information Bottleneck theory, initially increasing mutual information with training data before compressing it. The study reveals that different LLMs exhibit varying compression characteristics based on their training data and methodologies. Importantly, the degree of optimal compression achieved by a model correlates significantly with its performance across various benchmarks, suggesting that the representational structure of LLMs can provide actionable insights into their capabilities. The findings contribute to a unified information-theoretic framework for understanding LLMs and their learning processes, with implications for both model interpretability and performance prediction.
Methodology
The authors employed an information-theoretic approach to analyze the pre-training dynamics of various LLMs, measuring mutual information between representations and inputs/outputs. They compared the compression characteristics of different models and correlated these with performance metrics across multiple benchmarks.
Results
The study found that LLMs approach optimal compression during pre-training, with smaller models struggling to achieve meaningful compression. The degree of optimal compression was shown to predict performance on six benchmarks across different model families, with a strong correlation (r = 0.76, p < 0.001) between preference information and downstream performance.
Implications
The findings suggest that understanding LLMs through the lens of lossy compression can enhance interpretability and provide insights into their learning processes. This framework can be utilized to improve model design and performance prediction in various applications of LLMs.
Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
Theory
Graph Learning
Efficient ML
- Lumbermark is a robust clustering algorithm that can handle varying cluster sizes, densities, and shapes.
- The algorithm utilizes mutual reachability distances to enhance data distribution and reduce noise impact.
- Lumbermark allows users to specify the number of clusters, unlike HDBSCAN.
- An open-source implementation is available for both Python and R.
Read more
Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
Summary
The paper introduces Lumbermark, a novel divisive clustering algorithm designed to robustly detect clusters of varying sizes, densities, and shapes. The methodology involves iteratively removing large limbs from a dataset's mutual reachability minimum spanning tree (MST), which is constructed using mutual reachability distances. This approach smoothens the data distribution and mitigates the influence of low-density objects, such as noise points and outliers. Lumbermark serves as an alternative to the well-known HDBSCAN algorithm, allowing users to specify the desired number of clusters. The implementation of Lumbermark is made accessible through an open-source package for Python and R. The paper demonstrates that Lumbermark performs effectively on benchmark datasets, suggesting its utility for data scientists and practitioners across various domains.
Methodology
Lumbermark operates by constructing a mutual reachability minimum spanning tree (MST) from the dataset and iteratively chopping off large limbs connected by protruding segments. This process leverages mutual reachability distances to smooth the data distribution and diminish the effect of outliers and noise points.
Results
The results show that Lumbermark outperforms the previous leading algorithm, Genie, in clustering performance on benchmark datasets. The algorithm successfully identifies clusters of varying characteristics while maintaining robustness against noise and outliers.
Implications
The introduction of Lumbermark has significant implications for clustering applications in diverse fields such as software testing, geology, manufacturing, and traffic management. Its ability to handle complex cluster structures and user-defined cluster counts makes it a valuable tool for data analysis and decision-making.
Bi-level Heterogeneous Learning for Time Series Foundation Models: A Federated Learning Approach
Time Series
Federated Learning
- Introduction of FedTRL, a federated learning method for bi-level heterogeneous learning in time series data.
- Mitigation of inter-domain and intra-domain conflicts through domain-adversarial optimization and prototype alignment.
- Demonstrated superior performance of FedTRL over centralized and federated TSFM baselines in forecasting tasks.
- Provision of a flexible and scalable approach for training TSFMs in heterogeneous environments.
Read more
Bi-level Heterogeneous Learning for Time Series Foundation Models: A Federated Learning Approach
Summary
This paper addresses the challenges of training Time Series Foundation Models (TSFMs) in heterogeneous environments, where data varies significantly across domains and tasks. The authors propose a novel federated learning method named FedTRL, which aims to mitigate the issues of inter-domain and intra-domain heterogeneity. Traditional mixed-batch training strategies often lead to gradient conflicts and poor representation quality due to the diverse nature of time series data. FedTRL employs a fine-grained learning approach that distills invariant knowledge while minimizing cross-domain interference. It incorporates domain-adversarial optimization and prototype alignment to ensure semantically consistent local representations, alongside a domain-aware aggregation strategy that balances invariance and alignment. The effectiveness of FedTRL is demonstrated through extensive experiments, showing that it consistently outperforms both centralized and federated TSFM baselines in various forecasting tasks, including zero-shot performance. This work provides a scalable pathway for training TSFMs from scratch in decentralized settings, addressing the limitations of existing methods.
Methodology
The authors developed FedTRL, which integrates two main components: (1) a domain-adversarial optimization objective combined with prototype alignment to enforce semantically consistent local representations, and (2) a domain-aware aggregation strategy that adaptively balances the updates from different domains, addressing both inter-domain and intra-domain heterogeneity.
Results
FedTRL outperformed centralized TSFMs in both zero-shot point forecasting and probabilistic forecasting across multiple benchmarks, demonstrating its effectiveness in handling bi-level heterogeneity in time series data.
Implications
The proposed method has significant implications for various decision-making domains reliant on time series forecasting, such as finance, energy, and urban computing, by enabling more accurate and robust forecasting models that can be trained in decentralized settings.
Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception
Multimodal
- CauPsi models the hierarchical dependencies among multiple driving-related tasks, enhancing inter-task information transfer.
- The framework introduces a Causal Task Chain for soft-label propagation, reflecting cognitive processes in driving.
- CTPC integrates psychological state signals into all tasks, addressing the impact of driver emotions on performance.
- CauPsi achieves state-of-the-art accuracy on the AIDE dataset with a compact model size.
Read more
Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception
Summary
This paper introduces CauPsi, a novel cognitive-causal multi-task learning framework designed for advanced driver assistance systems (ADAS). The framework addresses the limitations of existing methods that treat recognition tasks as independent, failing to leverage the cognitive causal structure inherent in driving behavior. CauPsi incorporates two main mechanisms: a Causal Task Chain that facilitates the propagation of task predictions through learnable prototype embeddings, and Cross-Task Psychological Conditioning (CTPC) that estimates psychological state signals from driver facial expressions and body posture, integrating these signals into all tasks. The framework is evaluated on the AIDE dataset, achieving a mean accuracy of 82.71% with only 5.05 million parameters, outperforming previous methods by 1.0%, particularly in Driver Emotion Recognition (DER) and Driver Behavior Recognition (DBR). Ablation studies confirm the independent contributions of each component, and the psychological state signal demonstrates systematic task-label-dependent patterns without requiring explicit annotations.
Methodology
CauPsi employs a Causal Task Chain for task prediction propagation and Cross-Task Psychological Conditioning (CTPC) to incorporate psychological state signals. The model uses a frozen MobileNetV3-Small backbone and implements bidirectional Cross-View Attention to enhance feature extraction from both driver and environmental perspectives.
Results
CauPsi achieved a mean accuracy of 82.71% on the AIDE dataset, surpassing previous benchmarks by 1.0%. Notable improvements were observed in Driver Emotion Recognition (+3.65%) and Driver Behavior Recognition (+7.53%). The framework maintained a low parameter count of 5.05 million, demonstrating efficiency alongside performance.
Implications
The proposed framework has significant implications for the development of more effective ADAS by improving the understanding of driver behavior and emotional states. It can enhance safety and performance in autonomous driving systems by integrating cognitive and psychological factors into decision-making processes.
Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
Theory
Optimization
- Pareto regret in stochastic MO-MABs is governed by the maximum sub-optimality gap g†.
- The proposed algorithm achieves optimal Pareto regret of O(K log T/g†).
- The algorithm utilizes a two-layer uncertainty quantification strategy for effective exploration and exploitation.
- Numerical experiments validate the algorithm's performance, showing lower finite-horizon Pareto regret compared to existing methods.
Read more
Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
Summary
This paper investigates the complexity of stochastic multi-objective bandits (MO-MABs) compared to single-objective bandits, particularly focusing on the concept of Pareto regret. The authors address a critical question in the field: whether optimizing multi-objective bandits is inherently more difficult than single-objective bandits due to the added complexity of multiple reward dimensions. They provide a comprehensive analysis showing that in the stochastic setting, Pareto regret is governed by the maximum sub-optimality gap, denoted as g†, leading to a finite-time Pareto regret bound of O(K log T/g†). The authors introduce a novel algorithm that achieves this optimal regret bound by employing a two-layer uncertainty quantification approach, combining a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Through extensive numerical experiments, the proposed algorithm demonstrates significant improvements over benchmark methods, confirming the theoretical findings and showcasing its efficiency in managing the complexities of multi-objective optimization.
Methodology
The authors develop a width-guided first-certification algorithm that operates under the assumption of a unique winner for each objective. The algorithm employs upper and lower confidence bound estimators to quantify uncertainty across both arms and objectives. It integrates a top-two racing strategy for arm selection and an uncertainty-greedy rule for dimension selection, balancing exploration and exploitation effectively.
Results
The paper establishes a finite-time upper bound for Pareto regret of O(K log T/g†) and an asymptotic lower bound of Ω(K log T/g†), indicating that the proposed algorithm is optimal. The numerical experiments demonstrate that the width-guided first-certification policy consistently outperforms the Pareto UCB1 method, achieving lower finite-horizon Pareto regret across various configurations.
Implications
The results have significant implications for applications in fields requiring multi-objective optimization, such as clinical trials, recommendation systems, and resource allocation problems. The findings suggest that efficient algorithms can be developed for complex decision-making scenarios without incurring excessive computational costs.
Distributed Interpretability and Control for Large Language Models
Large Language Models
Interpretability
Efficient ML
- Introduces a tensor-parallel inference architecture for LLMs that supports activation-level interpretability and steering.
- Achieves up to 7x reduction in activation memory and up to 41x increase in throughput compared to baseline methods.
- Enables real-time behavioral steering with stable, monotonic output modifications without additional passes.
- Demonstrates effectiveness across multiple large models, including LLaMA and Qwen.
Read more
Distributed Interpretability and Control for Large Language Models
Summary
This paper addresses the challenges of interpretability and control in large language models (LLMs) that require multi-GPU setups. The authors propose a novel framework that integrates activation-level interpretability and steering directly into the tensor-parallel inference path, allowing for efficient analysis and control of models with up to 70 billion parameters. The system employs a single-pass architecture that captures activations without additional forward passes, significantly reducing memory usage and increasing throughput. The implementation demonstrates a reduction in activation memory by up to 7x and an increase in throughput by up to 41x compared to existing methods. The authors also introduce steering vectors that enable controllable output modifications without fine-tuning or additional computational overhead. The results indicate that their approach supports real-time behavioral control and full-layer interpretability, making it feasible to analyze and steer large models effectively in practical applications.
Methodology
The authors developed a distributed, single-pass framework that captures activations during standard autoregressive decoding. This method integrates logit lens analysis and activation steering directly into the tensor-parallel execution path, allowing for efficient memory usage and real-time control without the need for multiple forward passes or full vocabulary projections.
Results
The proposed system demonstrated a significant reduction in activation memory (up to 7x) and an increase in throughput (up to 41x) compared to existing implementations. The steering vectors achieved a mean steerability slope of 0.702 across evaluated datasets, indicating reliable and controllable output shifts without fine-tuning.
Implications
The findings suggest that the proposed framework can enhance the interpretability and controllability of large language models in various applications, including scientific analysis, healthcare, and interactive decision-making. This work paves the way for unified systems that integrate interpretability and control, making advanced LLMs more accessible for research and production.
How to sketch a learning algorithm
Theory
Interpretability
Efficient ML
- Introduces a data deletion scheme that predicts model behavior with diminishing error ε.
- The scheme's precomputation and prediction are computationally efficient, only slightly slower than traditional methods.
- The concept of 'stability' is central to the methodology, allowing for robust predictions despite data exclusion.
- Experiments with microgpt support the stability assumption and its applicability to powerful models.
Read more
How to sketch a learning algorithm
Summary
This paper addresses the critical question of how training data influences AI model behavior, focusing on the data deletion problem. The authors introduce a novel data deletion scheme that allows for accurate predictions of model outputs when a subset of training data is excluded. This scheme operates with a prediction error that diminishes to ε, while maintaining computational efficiency—both the precomputation and prediction algorithms are only poly(1/ε) factors slower than standard training and inference. The core assumption of the proposed method is termed 'stability,' which quantifies the sensitivity of model outputs to changes in training data. The authors validate this assumption through experiments with microgpt, demonstrating its compatibility with powerful AI models. The methodology involves locally sketching an arithmetic circuit using higher-order derivatives computed via forward-mode automatic differentiation. The results indicate that the proposed scheme can efficiently handle data deletion, providing significant insights into model interpretability and data influence.
Methodology
The authors develop a data deletion scheme consisting of two algorithms: a precomputation algorithm that trains the model and stores auxiliary information, and a prediction algorithm that uses this information to estimate model outputs after excluding specific training data. The methodology relies on the assumption of stability, which assesses the sensitivity of model outputs to changes in training data. The technical implementation involves locally sketching arithmetic circuits using higher-order derivatives computed via forward-mode automatic differentiation.
Results
The proposed data deletion scheme achieves a prediction error of O(ε) for a constant number of deletions, with stronger stability assumptions allowing for bounded prediction error for arbitrary deletions. The computational overhead for precomputation and prediction remains manageable, making the approach practical for real-world applications.
Implications
This work has significant implications for model interpretability, privacy, and understanding the influence of training data on AI models. It provides a framework for efficiently assessing the impact of specific data points on model behavior, which is crucial for applications in sensitive areas such as healthcare and finance.
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
Robotics
Interpretability
Reinforcement Learning
- Introduces an event-centric framework for world modeling in autonomous agents.
- Utilizes memory-augmented retrieval for decision-making based on prior experiences.
- Ensures interpretability and consistency with physical constraints in decision-making.
- Demonstrates effectiveness in real-time UAV flight scenarios.
Read more
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
Summary
This paper presents an innovative framework for decision-making in autonomous agents operating in dynamic environments, particularly focusing on Unmanned Aerial Vehicles (UAVs). The proposed event-centric world modeling framework utilizes memory-augmented retrieval to enhance decision-making processes. By representing the environment as a structured set of semantic events, the framework encodes these events into a permutation-invariant latent representation. Decision-making is facilitated through a knowledge bank of prior experiences, where each entry links an event representation to a corresponding maneuver. The final action is computed as a weighted combination of retrieved solutions, promoting transparency and interpretability in decision-making. The integration of physics-informed knowledge ensures that selected maneuvers align with observed system dynamics. Experimental evaluations in UAV flight scenarios demonstrate that the framework operates effectively within real-time control constraints while maintaining consistent and interpretable behavior.
Methodology
The methodology involves four core stages: perception and feature extraction from sensory inputs, event abstraction into a structured event list, compression into a latent event code, and retrieval-based decision-making using a physics-informed knowledge bank. The decision-making process leverages a similarity function to compute a weighted combination of stored solutions, enabling interpretable and structured decision-making.
Results
The experimental evaluation shows that the proposed framework successfully operates under real-time constraints while providing interpretable decision-making. The integration of physics-informed knowledge enhances the consistency of maneuvers with the dynamics of the environment.
Implications
This framework has significant implications for the development of autonomous systems in safety-critical applications, such as UAVs, where reliable and interpretable decision-making is essential. It can potentially be applied to various domains requiring dynamic environment modeling and real-time decision-making.
Bi-Lipschitz Autoencoder With Injectivity Guarantee
Theory
Optimization
Generative Models
- Introduces the Bi-Lipschitz Autoencoder (BLAE) to address non-injectivity in autoencoders.
- Proposes an injective regularization scheme to eliminate local minima during optimization.
- Implements a bi-Lipschitz relaxation to ensure geometric preservation and robustness to distribution shifts.
- Demonstrates superior performance of BLAE over existing methods in preserving manifold structure.
Read more
Bi-Lipschitz Autoencoder With Injectivity Guarantee
Summary
This paper addresses the limitations of traditional autoencoders in preserving manifold geometry during dimensionality reduction, particularly focusing on the issue of non-injective mappings that lead to poor convergence and distorted latent representations. The authors propose the Bi-Lipschitz Autoencoder (BLAE), which incorporates two main innovations: an injective regularization scheme that utilizes a separation criterion to avoid local minima, and a bi-Lipschitz relaxation that maintains geometric properties while being robust to data distribution shifts. The paper formalizes the concept of admissible regularization and provides sufficient conditions for its application, ensuring that the imposed geometric properties are satisfied. Empirical evaluations demonstrate that BLAE outperforms existing methods across various datasets, effectively preserving manifold structure and exhibiting resilience to sampling sparsity and distribution changes.
Methodology
The authors develop the Bi-Lipschitz Autoencoder (BLAE) by combining injective and bi-Lipschitz regularization techniques. The injective regularization is based on a separation criterion that ensures the encoder's mapping is injective, while the bi-Lipschitz relaxation allows for efficient dimensionality reduction without compromising geometric integrity. The framework is validated through theoretical analysis and extensive empirical testing on multiple datasets.
Results
The experimental results indicate that BLAE consistently outperforms traditional autoencoders and other advanced variants in terms of preserving the intrinsic structure of data manifolds. The method shows significant robustness against distribution shifts and sparse sampling conditions, confirming its effectiveness in various practical scenarios.
Implications
The findings suggest that BLAE can be effectively utilized in applications requiring reliable dimensionality reduction and visualization of high-dimensional data, such as in computer vision and data analysis tasks. The framework's robustness to distributional changes also makes it suitable for dynamic environments where data characteristics may evolve over time.
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Graph Learning
- Introduces a graph-based anomaly detection system for microservices.
- Utilizes unsupervised learning with GCN-GAE for structural representation.
- Achieves high precision and low false positive rates in anomaly detection.
- Identifies the limitations of traditional load tests in simulating real traffic.
Read more
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Summary
This paper presents a novel graph-based anomaly detection system designed for microservice architectures, specifically targeting the challenges faced by Prime Video during high-traffic events. Traditional load tests often fail to accurately simulate real-world traffic patterns, leading to potential service failures. The authors propose an unsupervised approach utilizing Graph Convolutional Networks (GCN) and Graph Autoencoders (GAE) to learn structural representations of service interactions at a minute-level resolution. By comparing embeddings from load tests with those from actual event traffic, the system identifies anomalies in service behavior. The methodology includes multi-snapshot training to enhance scalability and cosine similarity for anomaly scoring. The results demonstrate a high precision of 96% and a low false positive rate of 0.08%, although recall is limited at 58%. The study also introduces a synthetic anomaly injection framework for controlled evaluation, providing insights into detection lead time and operational needs. Overall, the proposed system bridges the gap between simulated and real-world service behavior, offering a foundation for broader applications in microservice ecosystems.
Methodology
The authors developed an unsupervised anomaly detection system based on structural graph representations using Graph Convolutional Networks and Graph Autoencoders. The system learns from directed, weighted service interaction graphs and employs cosine similarity for anomaly scoring. Multi-snapshot training is utilized to enhance scalability and performance.
Results
The system achieved a precision of 96% and a false positive rate of 0.08%. However, the recall was limited to 58% under conservative assumptions. The synthetic anomaly injection framework demonstrated practical utility and provided insights into the operational aspects of anomaly detection.
Implications
This research has significant implications for improving the reliability and performance of microservice architectures, particularly in high-traffic scenarios. The findings can enhance incident response capabilities and inform the design of more effective load testing strategies, ultimately improving customer experience and operational efficiency.
Spectral Edge Dynamics Reveal Functional Modes of Learning
Theory
Interpretability
Optimization
- Identification of a spectral edge that distinguishes grokking from non-grokking regimes.
- Standard interpretability methods fail to capture the spectral edge, indicating its non-localized nature.
- Functional modes exhibit structured behavior in symmetry-adapted bases, revealing harmonic structures for simpler tasks.
- Complex tasks require richer functional descriptions beyond simple harmonic bases.
Read more
Spectral Edge Dynamics Reveal Functional Modes of Learning
Summary
This paper investigates the training dynamics of transformer models during the process of 'grokking'—a phenomenon where models achieve sudden performance improvements on tasks. The author employs spectral analysis of weight updates to identify a 'spectral edge,' a set of leading directions that effectively distinguishes grokking from non-grokking behaviors across various tasks. The study reveals that these leading directions do not correspond to localized structures in parameter or feature space, but rather represent functional modes that indicate how the model's input-output behavior changes. The analysis shows that for tasks with symmetry, such as addition and multiplication, the spectral edge simplifies to a single dominant Fourier mode when expressed in the correct mathematical basis. In contrast, more complex tasks like quadratic functions exhibit a richer structure that cannot be captured by simple harmonic representations. The findings also highlight that multitask training can enhance the compositional structure of learned functions, suggesting that neural networks can reuse functional primitives across different tasks. Overall, the paper proposes a shift in understanding learning dynamics, emphasizing the discovery of low-dimensional functional subspaces rather than localized circuits or features.
Methodology
The study employs spectral analysis of weight updates in transformer models trained on modular arithmetic tasks. It uses singular value decomposition to identify the spectral edge and analyzes the functional content of leading directions in both parameter and input-output spaces.
Results
The spectral edge consistently emerges during grokking, revealing low-dimensional functional subspaces that govern learning. For symmetry-aligned tasks, the edge simplifies to dominant Fourier modes, while more complex tasks exhibit structured but non-harmonic behavior. Additionally, multitask training enhances the alignment of functional modes across tasks.
Implications
The findings suggest that understanding the dynamics of learning in neural networks can benefit from focusing on functional modes rather than localized features. This perspective could influence the design of models and training strategies to promote functional reuse and efficiency in learning.
Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks
Reinforcement Learning
Optimization
- Introduces a reinforcement learning framework that incorporates reward machines for sleep control in mobile networks.
- Addresses the non-Markovian nature of QoS constraints by tracking historical performance through RMs.
- Balances immediate energy savings with long-term QoS impacts, enhancing energy efficiency in telecommunications.
- Demonstrates scalability and adaptability in complex network environments with varying traffic patterns.
Read more
Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks
Summary
This paper addresses the challenge of optimizing sleep control mechanisms in mobile networks to enhance energy efficiency while maintaining quality of service (QoS). As mobile networks become denser, the energy consumption of radio base stations (RBSs) increases significantly. The authors propose a novel approach that combines reinforcement learning (RL) with reward machines (RMs) to make informed decisions about which radio units (RUs) to put to sleep, when, and for how long. The proposed method balances immediate energy savings against long-term QoS impacts, specifically focusing on time-averaged packet drop rates for deadline-constrained traffic and minimum throughput guarantees for constant-rate users. The challenge lies in the non-Markovian nature of the effective reward, which depends on historical performance rather than just the current state. By utilizing RMs, the framework maintains an abstract state that tracks QoS constraint violations over time, transforming the problem into a Markovian one. This allows for efficient policy learning that accommodates multi-slot commitments and diverse traffic patterns, making it suitable for dynamic wireless environments.
Methodology
The authors employ a reinforcement learning approach augmented with reward machines to optimize sleep control decisions in mobile networks. The reward machines serve as a finite-state automaton that tracks the history of QoS constraint violations, allowing the system to maintain an abstract state that incorporates historical performance. This transformation enables the problem to be treated as Markovian, facilitating efficient policy learning that can handle long-term commitments and diverse QoS requirements.
Results
The proposed framework successfully balances energy efficiency and QoS in mobile networks, demonstrating improved performance in managing sleep control of radio units. The integration of reward machines allows for effective learning of policies that adapt to varying traffic conditions and QoS constraints, leading to significant energy savings without compromising service quality.
Implications
The findings suggest that the proposed RL and RM framework can be effectively applied to optimize energy management in next-generation mobile networks, potentially leading to more sustainable telecommunications infrastructure. This approach can also be extended to other domains where historical performance impacts decision-making under constraints.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
NLP
Large Language Models
Efficient ML
- Introduces the Master Key Hypothesis for cross-model capability transfer.
- Proposes Unlock, a training-free and label-free framework for capability extraction and transfer.
- Demonstrates significant performance improvements in reasoning tasks without retraining.
- Shows that capability transfer can approach the performance of post-trained models.
Read more
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
Summary
This paper introduces the Master Key Hypothesis, which posits that model capabilities can be represented as directions in a low-dimensional latent subspace, allowing for the transfer of these capabilities across different models without the need for retraining. The authors propose a novel framework called Unlock, which is training-free and label-free, to extract capability directions by contrasting activations between models with and without specific capabilities. This direction is then aligned with a target model through a low-rank linear transformation and applied during inference to elicit desired behaviors. The framework is evaluated on reasoning tasks, specifically Chain-of-Thought (CoT) and mathematical reasoning, demonstrating significant performance improvements across various model scales. For instance, transferring CoT reasoning from a larger model to a smaller one resulted in a 12.1% accuracy increase on mathematical tasks. The findings suggest that the success of capability transfer is contingent on the pre-trained capabilities of the models involved, and that the proposed method can effectively amplify latent capabilities, thereby reducing redundancy in model training and enhancing modular reuse of existing capabilities.
Methodology
The Unlock framework consists of three main stages: extracting a capability direction (Master Key) from a source model by contrasting activations, estimating a low-rank linear transformation to align this direction with a target model, and applying the transferred direction at inference time to elicit the desired behavior. This process is entirely training-free and label-free, relying solely on forward passes through the models.
Results
The experiments revealed substantial improvements in reasoning capabilities, such as a 12.1% accuracy gain in mathematical reasoning when transferring capabilities from a larger model to a smaller one. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B improved accuracy from 9.2% to 56.0%, closely matching the performance of a 7B instruction-tuned model with CoT prompting.
Implications
The proposed method has the potential to significantly reduce training costs and time by enabling the reuse of capabilities across models without the need for extensive retraining. This could accelerate the development of new models and facilitate more efficient use of existing model capabilities.
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
Reinforcement Learning
Efficient ML
Robotics
- Introduction of the ANDROID COACH framework for online RL in Android agents.
- Shift from Single State Single Action to Single State Multiple Actions paradigm to enhance exploration.
- Utilization of a critic for action value estimation to reduce emulator interaction overhead.
- Integration of a process reward model for improved supervision of agent actions.
Read more
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
Summary
The paper introduces ANDROID COACH, a novel framework aimed at enhancing the efficiency of online reinforcement learning (RL) for Android agents. Traditional methods rely on a Single State Single Action (SSSA) paradigm, which limits exploration and leads to inefficiencies due to high emulator latency. ANDROID COACH shifts to a Single State Multiple Actions (SSMA) approach, allowing agents to sample and evaluate multiple actions for a single state without incurring additional emulator costs. This is achieved through an online learning critic that estimates action values, complemented by a process reward model and a group-wise advantage estimator. The proposed method significantly improves training efficiency and success rates in benchmark environments, demonstrating its potential to optimize RL for GUI agents.
Methodology
ANDROID COACH employs an actor-critic framework that utilizes the SSMA paradigm. It allows the agent to sample multiple actions for a single state and evaluates these actions using a critic that estimates their values. The critic is updated based on online rollout data, and a novel advantage estimation method, Actor-Critic Leave-one-out (ACLOO), is introduced to guide the agent's learning. The framework also incorporates a process reward model to credit intermediate steps within trajectories.
Results
The experiments conducted on AndroidLab and AndroidWorld benchmarks show that ANDROID COACH achieves a 7.5% and 8.3% improvement in success rates compared to the UI-TARS-1.5-7B baseline. Additionally, it demonstrates 1.4 times higher training efficiency than traditional SSSA methods like PPO and GRPO at matched success rates.
Implications
The ANDROID COACH framework has the potential to significantly enhance the training efficiency of RL agents in interactive environments, particularly for GUI applications. Its approach could be applied to various domains requiring efficient online learning and exploration, potentially leading to more robust and capable AI agents.
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Large Language Models
Efficient ML
- Introduces the concept of activation budget for expert activations in MoE models.
- Proposes Alloc-MoE, a unified framework optimizing budget allocation at both layer and token levels.
- Alloc-L utilizes sensitivity profiling and dynamic programming for optimal layer-level activation allocation.
- Alloc-T dynamically reallocates expert activations based on routing scores to enhance performance.
Read more
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Summary
The paper introduces Alloc-MoE, a novel framework designed to optimize expert activation allocation in Mixture-of-Experts (MoE) models under a constrained activation budget. MoE architectures are known for their sparse activation mechanism, which enhances the scalability of large language models. However, the high number of expert activations can lead to significant latency during inference, particularly in resource-constrained environments. The authors propose the concept of an 'activation budget' to manage the number of expert activations effectively. Alloc-MoE operates at both the layer and token levels, employing two strategies: Alloc-L for layer-level allocation, which uses sensitivity profiling and dynamic programming to determine optimal activation distribution across layers, and Alloc-T for token-level redistribution based on routing scores. This dual approach minimizes performance degradation while maintaining efficiency. Experimental results demonstrate that Alloc-MoE can achieve substantial speedups in inference latency without compromising model performance, particularly on the DeepSeek-V2-Lite model, where it achieves 1.15× speedup in prefill and 1.34× in decode with only half the original activation budget.
Methodology
The methodology involves two main components: Alloc-L and Alloc-T. Alloc-L focuses on layer-level expert activation allocation by profiling layer sensitivity and solving an optimization problem using dynamic programming. Alloc-T redistributes expert activations at the token level based on routing scores, ensuring that tokens with less concentrated routing distributions receive more expert activations, all while adhering to the defined activation budget.
Results
The experiments conducted across multiple MoE models show that Alloc-MoE effectively maintains model performance under a constrained activation budget. Specifically, on the DeepSeek-V2-Lite model, Alloc-MoE achieves a 1.15× speedup in prefill and a 1.34× speedup in decode when using only half of the original expert activation budget, demonstrating its efficiency and effectiveness.
Implications
The findings suggest that Alloc-MoE can be applied in real-world scenarios where resource constraints are critical, such as mobile or edge computing environments. The framework can enhance the deployment of large language models by reducing latency while preserving accuracy, making it suitable for applications requiring efficient inference.
The Rhetoric of Machine Learning
Theory
- Machine learning is inherently rhetorical, aiming to persuade rather than merely report facts.
- The concept of 'manipulation as a service' highlights the commercial use of ML technologies.
- Viewing ML through the lens of rhetoric can provide fresh insights and stimulate new discussions.
- The paper critiques the notion of ML as a purely objective technology, emphasizing its socio-technical implications.
Read more
The Rhetoric of Machine Learning
Summary
In 'The Rhetoric of Machine Learning', Robert C. Williamson explores the intersection of machine learning (ML) and rhetoric, arguing that ML is not a neutral tool for building objective world models but is inherently rhetorical in nature. He posits that ML systems are designed to persuade users, often presenting their outputs as factual while manipulating perceptions. The paper discusses the concept of 'manipulation as a service' as a prevalent business model leveraging ML technologies. Williamson emphasizes the importance of understanding ML through a rhetorical lens, suggesting that this perspective can open new avenues for inquiry and discussion. He critiques the traditional views of ML as purely mathematical or technical, advocating for a socio-technical approach that considers the implications of ML in society. The paper is intended to stimulate dialogue rather than provide definitive answers, drawing on a diverse literature to challenge existing narratives about ML and its applications.
Methodology
The paper employs a rhetorical analysis framework to examine machine learning technologies, drawing on literature from various fields including philosophy, science, and economics. It synthesizes theoretical insights to challenge conventional views of ML.
Results
Williamson concludes that machine learning systems are not just technical tools but are embedded in socio-technical contexts that influence their interpretation and use. This perspective encourages a critical examination of how ML outputs are accepted and utilized in practice.
Implications
The findings suggest that stakeholders in ML development and deployment should be aware of the persuasive nature of these technologies. This awareness can lead to more responsible use and a better understanding of the societal impacts of ML systems.
Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
Theory
- Rigorous proof of convergence for one-hidden-layer neural networks with fixed biases using L2 loss and gradient descent.
- Introduction of a new activation function, FReX, which maintains convergence properties similar to ReLU.
- Establishment of the spectral bias property for the learning process.
- Discussion on the representability of functions and the uniqueness of parameterization in the proposed models.
Read more
Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
Summary
This paper presents a mathematical analysis of a simple one-hidden-layer neural network utilizing ReLU activation functions with fixed biases, focusing on one-dimensional input and output. The authors rigorously prove the convergence of the learning process using the L2 squared loss function and gradient descent, establishing the spectral bias property for the learning process. The analysis highlights the necessary properties that activation functions should possess and explores the relationship between the spectrum of certain operators and the learning process. A new activation function, the full-wave rectified exponential function (FReX), is proposed, which also demonstrates convergence in gradient descent. The study emphasizes that the ReLU activation function is effective due to its role as a fundamental solution to the one-dimensional Laplacian, and it conjectures that activation functions that are fundamental solutions to second-order differential operators may yield networks with favorable representability and convergence properties. The paper aims to provide insights into the basic mechanisms of neural networks, with a view towards extending the analysis to higher-dimensional models in future work.
Methodology
The authors analyze both continuous and discrete versions of a one-hidden-layer neural network model. They employ mathematical techniques from functional analysis and partial differential equations to rigorously prove convergence and spectral properties. The analysis includes the derivation of properties of the ReLU activation function and the proposed FReX activation function.
Results
The paper demonstrates that learning with the standard L2 loss function converges to a unique minimum for the analyzed models. It also proves the spectral bias property and establishes that the proposed FReX activation function retains similar convergence characteristics as the ReLU function.
Implications
The findings suggest that the structure and properties of activation functions are crucial for the convergence and representability of neural networks. The proposed FReX function may offer a new avenue for improving neural network performance, particularly in scenarios where traditional activation functions may fall short.
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Reinforcement Learning
Large Language Models
- RLVR is robust to noisy rewards, with noise rates up to 15% showing minimal impact on performance.
- Imperfect verification does not fundamentally hinder RLVR effectiveness.
- Precision in verification is more critical than recall.
- Diminishing returns are observed when improving verifier accuracy beyond a certain point.
Read more
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Summary
This paper investigates the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) in the presence of noisy rewards, particularly focusing on the accuracy required from verifiers during training. The authors introduce noise into RL training in the domains of code generation and scientific reasoning, revealing that noise rates up to 15% do not significantly affect peak validation accuracy compared to a clean baseline. The study encompasses various model families (Qwen3, GLM4, Llama 3.1) and sizes (4B to 9B), demonstrating that RLVR can tolerate imperfect verification without a fundamental performance drop. The findings suggest that practitioners should prioritize moderate accuracy with high precision over striving for perfect verification, as the engineering effort to enhance verifier accuracy beyond a certain threshold yields diminishing returns. This work contributes to the understanding of how RLVR can be effectively applied in real-world scenarios where verifiers are often imperfect.
Methodology
The authors conducted experiments by introducing controlled and realistic noise patterns into the RL training process, specifically targeting coding tasks that mimic rubric-based rewards. They evaluated the performance of various model families and sizes under different noise conditions to assess the impact on validation accuracy.
Results
The experiments revealed that noise rates up to 15% resulted in peak validation accuracy within 2 percentage points of the clean baseline, indicating strong robustness of RLVR to noise. The results were consistent across different noise types and model configurations, supporting the conclusion that an imperfect verifier is sufficient for effective training.
Implications
These findings suggest that RLVR can be effectively utilized in practical applications where verifiers are not perfect, such as in semi-verifiable domains like finance and law. The insights on prioritizing precision over perfect accuracy can guide practitioners in designing more efficient verification systems.
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
Large Language Models
NLP
Multimodal
- Introduces a regime-centric framework linking data distribution to learning dynamics in LLMs.
- Demonstrates that benchmark-aligned data improves narrow metrics but limits broader capabilities.
- Shows that coverage-expanding data leads to better generalization and parameter adaptation.
- Presents parameter-space diagnostics to reveal structural signatures of training regimes.
Read more
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
Summary
This paper investigates the relationship between data distribution and model performance in large language models (LLMs), particularly focusing on the discrepancy between benchmark scores and broader capabilities. The authors hypothesize that different training regimes, influenced by data distribution, affect how models develop their capabilities. They introduce a regime-centric framework that distinguishes between benchmark-aligned data, which enhances narrow evaluation metrics but limits broader representational development, and coverage-expanding data, which promotes better generalization through more distributed parameter adaptation. Through controlled experiments on a text-only decoder model and subsequent tests on multimodal systems, the authors reveal that the structure of training data plays a crucial role in shaping learning dynamics. They also propose parameter-space diagnostics based on spectral and rank analyses to identify distinct structural signatures associated with different training regimes. The findings suggest that benchmark performance alone is insufficient to characterize model capability, emphasizing the importance of data distribution in influencing learning outcomes.
Methodology
The authors conducted controlled experiments using a text-only decoder model to isolate the effects of different data distributions under fixed training conditions. They employed parameter-space diagnostics, including spectral and rank analyses, to characterize how different training regimes influence model representations across layers. The findings were validated through tests on multimodal systems to assess the broader applicability of the results.
Results
The study found that benchmark-aligned data enhances narrow evaluation metrics but restricts broader representational development, while coverage-expanding data fosters improved generalization and more distributed parameter adaptation. Similar patterns were observed across various open-source model families, indicating that these effects are not limited to controlled settings. The case study on prompt repetition illustrated that not all data artifacts induce regime shifts, underscoring the importance of semantic structure.
Implications
The findings suggest that model training should consider the distribution and composition of data to enhance generalization and robustness. This has implications for the design of training datasets and the evaluation of model capabilities, indicating that relying solely on benchmark scores may not provide a complete picture of a model's performance.
Tree-of-Evidence: Efficient 'System 2' Search for Faithful Multimodal Grounding
Multimodal
Interpretability
Optimization
- Tree-of-Evidence (ToE) improves interpretability of Large Multimodal Models (LMMs) by framing it as a discrete optimization problem.
- ToE employs lightweight Evidence Bottlenecks and a beam search strategy to identify compact evidence sets for model predictions.
- The method retains over 98% predictive performance while significantly reducing the number of evidence units needed.
- ToE adapts its search strategy based on the ambiguity of the evidence, effectively integrating both time-series and textual data.
Read more
Tree-of-Evidence: Efficient 'System 2' Search for Faithful Multimodal Grounding
Summary
The paper introduces Tree-of-Evidence (ToE), an innovative inference-time search algorithm designed to enhance the interpretability of Large Multimodal Models (LMMs) in high-stakes domains like healthcare. Traditional interpretability methods often fail to accurately represent the decision-making processes of these models, particularly when integrating diverse data types such as time-series and text. ToE addresses this challenge by framing interpretability as a discrete optimization problem. It utilizes lightweight Evidence Bottlenecks to score groups of data and employs a beam search strategy to identify the minimal evidence set necessary to replicate the model's predictions. The authors evaluate ToE across six tasks involving three datasets, demonstrating that it can produce auditable evidence traces while maintaining predictive performance. The results indicate that ToE retains over 98% of the full-model AUROC with as few as five evidence units, achieving higher decision agreement and lower probability fidelity error compared to existing methods. Qualitative analyses reveal that ToE adapts its search strategy based on the nature of the evidence, effectively using vital signs for straightforward cases and incorporating textual data when physiological signals are ambiguous. This approach provides a practical mechanism for auditing multimodal models, revealing the specific evidence units that support each prediction.
Methodology
The Tree-of-Evidence (ToE) framework consists of three phases: training modality-specific classifiers, training lightweight Evidence Bottlenecks to score evidence units, and performing a beam search at inference time to construct a compact evidence set. The search optimizes for decision agreement, probability stability, and evidence sparsity, allowing for a structured exploration of multimodal data.
Results
ToE was evaluated on six tasks across three datasets, including clinical prediction tasks on MIMIC-IV and eICU, as well as non-clinical fault detection on LEMMA-RCA. The results showed that ToE achieved over 98% AUROC retention with as few as five evidence units, demonstrating higher decision agreement and lower probability fidelity error compared to other interpretability methods.
Implications
The Tree-of-Evidence framework has significant implications for the deployment of multimodal models in high-stakes environments, such as healthcare, where interpretability and auditability are crucial. By providing clear evidence traces, ToE enhances the trustworthiness of model predictions and facilitates better decision-making by domain experts.
Latent Structure of Affective Representations in Large Language Models
NLP
Large Language Models
Interpretability
- LLMs learn coherent latent representations of affective emotions that align with psychological valence-arousal models.
- The representations exhibit modest nonlinearity, challenging the purely linear representation hypothesis.
- The geometry of these representations can be leveraged to quantify uncertainty in emotion recognition tasks.
- Complementary evidence includes parallels with human neural data and causal steering experiments for emotional valence.
Read more
Latent Structure of Affective Representations in Large Language Models
Summary
This paper investigates the latent structure of affective representations in large language models (LLMs), focusing on how these models encode emotions. The authors highlight the significance of understanding the geometric organization of emotions, which can enhance model transparency and safety. They employ geometric data analysis tools to explore how LLMs, specifically Gemma-2-9B, Mistral-7B, and LLaMA-3-70B-Instruct, develop coherent representations of emotions that align with established psychological models of valence and arousal. The study reveals that while these representations exhibit some nonlinearity, they can still be approximated linearly, supporting the linear representation hypothesis. Additionally, the authors demonstrate that the learned latent spaces can be utilized to quantify uncertainty in emotion processing tasks, suggesting practical applications for model interpretability and safety. The findings contribute to the ongoing debate about the geometry of representations in LLMs and provide insights into how these models may parallel human emotional understanding.
Methodology
The authors employed geometric data analysis tools, including classical multidimensional scaling (MDS) and Isometric Feature Mapping (Isomap), to recover latent space representations from LLM embeddings. They designed text-based emotion classification tasks to probe the organization of affective dimensions within the models.
Results
The study found that LLMs develop coherent internal representations of emotions consistent with established psychological models. Evidence of modest nonlinearity in the geometry of these representations was observed, and the structured representation spaces were shown to be useful for quantifying predictive uncertainty in emotion recognition tasks.
Implications
The findings suggest that understanding the latent structure of affective representations in LLMs can enhance model interpretability and safety, providing insights into how these models may mirror human emotional processing. This has potential applications in affective computing and AI safety measures.