AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
70
Papers today
8h
Update frequency
7
Days of history
Learning Retrieval Models with Sparse Autoencoders
NLP
Large Language Models
Efficient ML
- Introduction of SPLARE, a new LSR method utilizing Sparse Autoencoders.
- SAEs provide semantically structured, expressive, and language-agnostic features for retrieval.
- SPLARE outperforms traditional vocabulary-based LSR methods in multilingual and out-of-domain settings.
- The SPLARE-7B model achieves competitive results on the MMTEB benchmark.
Read more
Learning Retrieval Models with Sparse Autoencoders
Summary
This paper introduces SPLARE, a novel method for Learned Sparse Retrieval (LSR) that leverages Sparse Autoencoders (SAEs) to create interpretable and efficient retrieval models. Unlike traditional LSR approaches that rely on vocabulary-based representations, SPLARE utilizes SAEs to produce high-dimensional sparse representations that are semantically structured and language-agnostic. The authors argue that SAEs are particularly well-suited for LSR, as they can effectively encode queries and documents into sparse vectors over a latent vocabulary space. The paper presents a systematic evaluation of SPLARE across various benchmarks, demonstrating its superiority over existing vocabulary-based LSR methods, especially in multilingual and out-of-domain contexts. The SPLARE-7B model achieves top results on the MMTEB multilingual retrieval tasks and is capable of supporting over 100 languages through cross-lingual transfer. Additionally, a lighter 2B-parameter variant is introduced, showcasing the potential for efficient deployment in real-world applications.
Methodology
The authors propose SPLARE, which integrates pre-trained Sparse Autoencoders into the LSR framework. The method involves fine-tuning a Large Language Model (LLM) while keeping the SAE frozen, allowing for the generation of sparse latent representations of input tokens. These representations are aggregated into a single sparse vector using a pooling mechanism similar to SPLADE.
Results
SPLARE consistently outperforms existing vocabulary-based LSR models across a comprehensive set of benchmarks, particularly excelling in multilingual and out-of-domain retrieval tasks. The SPLARE-7B model achieves top results on the MMTEB retrieval tasks, demonstrating its effectiveness in producing generalizable sparse latent embeddings.
Implications
The findings suggest that using Sparse Autoencoders for retrieval tasks can enhance the interpretability and efficiency of search systems, particularly in multilingual contexts. SPLARE's architecture could be applied to various domains requiring robust retrieval performance, potentially influencing the design of future retrieval models.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
Theory
- Statistically consistent methods for LNL often underperform compared to empirical approaches despite theoretical guarantees.
- Providing a perfect noise transition matrix does not resolve the performance issues of noise correction methods.
- The failure of noise correction is attributed to deeper limitations in the corrected objective rather than just T estimation.
- A comprehensive analysis reveals three levels of understanding: macroscopic convergence, microscopic dynamics, and information-theoretic limits.
Read more
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
Summary
This paper investigates the persistent performance gap between statistically consistent methods for Learning with Noisy Labels (LNL) and empirical approaches. The authors challenge the common belief that the failure of noise correction methods is primarily due to difficulties in estimating the noise transition matrix (T). They conduct experiments using a perfect oracle transition matrix to isolate the core mechanisms of noise correction methods. Surprisingly, even under ideal conditions, these methods exhibit performance collapse during training, indicating that the issue is not merely an estimation problem but rather a deeper flaw in the corrected objective itself. The authors provide a unified analysis linking macroscopic convergence states, microscopic optimization dynamics, and information-theoretic limits, ultimately offering insights into why ideal noise correction fails and suggesting directions for developing more reliable LNL methods.
Methodology
The authors conducted controlled experiments using a perfect oracle transition matrix to evaluate the performance of noise correction methods. They employed a minimalistic experimental setup to isolate the core mechanisms and analyzed the results through macroscopic and microscopic perspectives, as well as an information-theoretic lens.
Results
The experiments demonstrated that even with a perfect transition matrix, noise correction methods like Forward Correction exhibited a rise-and-fall dynamic in performance, ultimately converging to poor results similar to uncorrected training. This finding indicates that the failure of these methods is not solely due to T estimation issues but reflects fundamental limitations in the noise correction objectives.
Implications
The insights from this study could guide the development of more robust and reliable methods for handling noisy labels in machine learning, potentially improving model generalization and performance in real-world applications where label noise is prevalent.
Do Diffusion Models Dream of Electric Planes? Discrete and Continuous Simulation-Based Inference for Aircraft Design
Generative Models
Optimization
Robotics
- Introduction of a hierarchical sampling approach using two diffusion models for eVTOL design.
- MixeDiT enables joint sampling of discrete and continuous parameters, improving design flexibility.
- MaskeDiT supports inference over variable-dimensional design spaces, addressing challenges in traditional SBI.
- First comprehensive application of SBI to a realistic, large-scale aerospace design problem with 144 topologies and up to 136 parameters.
Read more
Do Diffusion Models Dream of Electric Planes? Discrete and Continuous Simulation-Based Inference for Aircraft Design
Summary
This paper presents a novel approach to conceptual aircraft design, specifically for electric vertical take-off and landing (eVTOL) aircraft, using simulation-based inference (SBI). The authors introduce a hierarchical probabilistic model that employs two diffusion models to sample from a complex design space comprising both discrete topologies and continuous parameters. The first model, the Mixed Diffusion Transformer (MixeDiT), facilitates joint sampling of discrete configurations and continuous observations, while the second model, the Masked Diffusion Transformer (MaskeDiT), samples parameters conditioned on the selected topology. This dual-model approach allows for efficient exploration of the design space, addressing challenges such as variable-dimensional designs and the need for high-dimensional posterior distributions. The authors validate their methodology using the SUAVE tool for conceptual aircraft analysis, demonstrating that their approach can rediscover known design trends and significantly accelerate the design generation process. The paper contributes a comprehensive dataset of 30,276 eVTOL designs, enhancing the existing open-source simulation resources for aerospace design.
Methodology
The authors utilize a hierarchical probabilistic model consisting of two diffusion models: MixeDiT for joint sampling of discrete topologies and continuous parameters, and MaskeDiT for sampling parameters conditioned on the selected topology. The designs are represented as vectorizable tree structures, allowing for efficient exploration of the design space. The methodology is validated using the SUAVE tool for quantitative and qualitative analyses.
Results
The MixeDiT-MaskeDiT architecture successfully samples a full eVTOL design, rediscovering known trends in aircraft design and significantly accelerating the design generation process. The evaluation demonstrates the model's effectiveness in capturing desired performance metrics and adhering to physical laws governing aircraft design.
Implications
This work has significant implications for the aerospace industry, particularly in the design of eVTOL aircraft. The proposed methodology can streamline the design process, reduce the time and resources required for simulations, and enhance the exploration of innovative design configurations. Additionally, the comprehensive dataset provided can serve as a valuable resource for future research in aircraft design and simulation-based inference.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Generative Models
Optimization
Robotics
- Introduction of PhysMoDPO, a framework for optimizing humanoid motion generation.
- Integration of a Whole-Body Controller into the training pipeline for improved physical compliance.
- Use of physics-based and task-specific rewards for effective optimization.
- Demonstrated improvements in physical realism and task performance in simulations.
Read more
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Summary
The paper presents PhysMoDPO, a novel framework for generating physically plausible humanoid motions using a Direct Preference Optimization approach. Building on recent advancements in text-conditioned human motion generation through diffusion models, the authors address the challenge of ensuring that motions generated by these models remain compliant with physical constraints when executed in robotics applications. Traditional methods often rely on hand-crafted heuristics that can lead to significant deviations from the intended motion. In contrast, PhysMoDPO integrates a Whole-Body Controller (WBC) into the training pipeline, optimizing the diffusion model to produce outputs that are both physically realistic and aligned with textual instructions. The framework employs physics-based and task-specific rewards to guide the optimization process, ensuring that generated motions are feasible under dynamic conditions. Extensive experiments demonstrate that PhysMoDPO consistently enhances physical realism and task performance in simulated environments, and it successfully transfers to real-world applications on a G1 humanoid robot without requiring additional motion refinement. This work highlights the potential of combining generative models with physics-guided optimization to advance humanoid robotics.
Methodology
PhysMoDPO employs a Direct Preference Optimization framework that integrates a Whole-Body Controller (WBC) into the training of diffusion models. It generates multiple candidate motions based on input conditions, evaluates them using physics-based and task-specific rewards, and optimizes the model to ensure outputs are both physically feasible and aligned with the intended motion.
Results
The experiments show that PhysMoDPO significantly improves physical realism and task-related metrics in simulated environments. Additionally, the framework allows for zero-shot motion transfer to the Unitree G1 humanoid robot, demonstrating its effectiveness in real-world applications.
Implications
The findings suggest that combining generative models with physics-guided optimization can enhance the reliability of humanoid motion generation for robotics, paving the way for more advanced applications in animation, virtual avatars, and human-robot interaction.
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
Large Language Models
Efficient ML
- Introduces a novel expert replacing paradigm to reduce redundancy in MoE models.
- LightMoE framework enhances expert selection and recovery strategies.
- Achieves competitive performance with significant compression ratios.
- Demonstrates improvements over existing expert compression methods.
Read more
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
Summary
The paper introduces LightMoE, a novel framework aimed at reducing the memory demands of Mixture-of-Experts (MoE) models in Large Language Models (LLMs) by employing an expert replacing strategy. Traditional MoE models face challenges due to high memory usage from numerous expert modules, which limits their deployment. Existing compression techniques like pruning and merging often lead to irreversible knowledge loss or high training costs. LightMoE addresses these issues by replacing redundant experts with parameter-efficient modules while maintaining model performance through a low-cost recovery process. The framework incorporates adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results demonstrate that LightMoE achieves performance comparable to LoRA fine-tuning at a 30% compression ratio and outperforms existing methods with a 50% compression rate, showing an average performance improvement of 5.6% across five diverse tasks. This indicates that LightMoE effectively balances memory efficiency, training efficiency, and model performance, making it a promising approach for deploying MoE models in practical applications.
Methodology
The methodology involves three key stages: (1) adaptive thresholding for selecting less important experts, (2) hierarchical construction of shared bases with task-specific low-rank adaptation parameters, and (3) an annealed recovery process that gradually integrates original experts into the new structure during fine-tuning.
Results
LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio and outperforms existing methods by an average of 5.6% at a 50% compression rate across five diverse tasks, demonstrating its effectiveness in maintaining model performance while reducing memory usage.
Implications
The findings suggest that LightMoE can facilitate the deployment of large-scale MoE models in resource-constrained environments, potentially broadening the applicability of LLMs in real-world scenarios where memory efficiency is critical.
Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity
Theory
Optimization
- Identifies dual optimization failures in Single-Layer PINNs: baseline and compounding pathologies.
- Demonstrates that scaling behavior is governed by a complex, non-separable relationship rather than a simple power law.
- Highlights spectral bias as a significant factor hindering the learning of high-frequency solution components.
- Proposes a systematic methodology for measuring scaling effects in PINNs across different PDE types.
Read more
Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity
Summary
This paper investigates the empirical scaling laws governing Single-Layer Physics-Informed Neural Networks (PINNs) when applied to canonical nonlinear partial differential equations (PDEs). The author identifies two primary optimization pathologies: a baseline pathology where the solution error does not decrease with increasing network width, and a compounding pathology where this issue worsens with nonlinearity. The study reveals that a simple separable power law is inadequate to describe the scaling behavior, which is instead governed by a complex, non-separable relationship. The findings suggest that the primary bottleneck in training PINNs is optimization rather than approximation capacity, particularly due to the phenomenon of spectral bias, where networks struggle to learn high-frequency components of solutions that become more pronounced with increased nonlinearity. The paper proposes a methodology to empirically measure these scaling effects and tests a hypothesis regarding the relationship between network width, nonlinearity, and optimization challenges across various canonical PDEs.
Methodology
The study employs a systematic approach to analyze the scaling laws of Single-Layer PINNs by minimizing a total loss function that incorporates mean squared residuals for the PDE, boundary conditions, and initial conditions. The experiments involve training on a suite of canonical PDEs with varying nonlinearity, using the Adam optimizer over 25,000 epochs.
Results
The results indicate that the scaling exponent for the network width does not conform to theoretical expectations (α ≠ 0.5) and that the relationship between width and nonlinearity is non-separable. The empirical measurements across different PDEs confirm the presence of optimization pathologies that hinder performance, particularly in complex nonlinear scenarios.
Implications
The findings have significant implications for the design and training of PINNs, suggesting that optimization strategies need to be adapted to account for the unique challenges posed by nonlinearity and network architecture. This could lead to improved methodologies for solving complex physical systems modeled by PDEs.
Level Up: Defining and Exploiting Transitional Problems for Curriculum Learning
Theory
Efficient ML
Interpretability
- Introduces a method for identifying transitional problems that mark competence boundaries in machine learning models.
- Demonstrates that a curriculum based on these transitional problems significantly improves learning efficiency.
- Establishes a direct measure of problem difficulty relative to model competence, enhancing interpretability.
- Validates the approach through experiments in chess and mathematics, outperforming traditional training strategies.
Read more
Level Up: Defining and Exploiting Transitional Problems for Curriculum Learning
Summary
This paper addresses the challenges of curriculum learning in machine learning, which involves ordering training examples to enhance learning efficiency. The authors critique existing methods that either rely on static difficulty measures or dynamic approaches that are computationally expensive. They propose a novel framework for identifying 'transitional problems'—specific instances that serve as benchmarks between competence levels of a model. By measuring the difficulty of problems relative to the model's ability, the authors create a learner-specific curriculum that progresses from easier to harder problems. The effectiveness of this approach is demonstrated through experiments in chess and mathematics, showing that training on these transitional problems leads to superior learning outcomes compared to traditional methods. The findings suggest that this curriculum structure not only improves sample efficiency but also aligns closely with human learning paradigms, offering a more interpretable and principled approach to curriculum learning.
Methodology
The authors define a series of models with increasing performance levels and identify transitional problems that can be solved by models at or above a certain competence level. They employ empirical evaluations to characterize these problems and construct a curriculum that reflects human learning levels, avoiding the computational costs of dynamic curricula.
Results
The experiments conducted in chess and mathematics reveal that the proposed level-up curriculum based on transitional problems leads to better performance than other data ordering strategies, including i.i.d. sampling. The identified transitional problems correlate strongly with human measures of difficulty, indicating the method's effectiveness and interpretability.
Implications
This research has potential implications for improving training methodologies in various machine learning applications, particularly in domains where structured learning paths can enhance model performance. It suggests a shift towards more learner-specific curricula that can adapt to the evolving capabilities of models.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Large Language Models
Efficient ML
Optimization
- Introduction of hindsight-optimal reasoning length (HORL) for determining optimal exit points in reasoning tasks.
- Development of TERMINATOR, an inference-time early-exit algorithm that reduces unnecessary computation in LRMs.
- Creation of a novel dataset for optimal reasoning lengths based on the first logical arrival of answers.
- Significant reductions in CoT lengths (14%-55%) while outperforming existing methods across multiple datasets.
Read more
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Summary
The paper introduces TERMINATOR, an innovative early-exit strategy for Large Reasoning Models (LRMs) that addresses the issue of overthinking during Chain-of-Thought (CoT) reasoning. LRMs are known for their ability to perform complex reasoning tasks but often generate excessive intermediate tokens, leading to inefficient computation. The authors propose a novel concept called hindsight-optimal reasoning length (HORL), which identifies the optimal point at which an LRM can exit its reasoning process without sacrificing accuracy. By leveraging the first logical arrival of the final answer, the authors create a dataset to train TERMINATOR, which significantly reduces CoT lengths by 14% to 55% across four challenging datasets: MATH-500, AIME 2025, HumanEval, and GPQA. This approach not only enhances computational efficiency but also maintains or improves performance compared to existing state-of-the-art methods.
Methodology
The authors designed TERMINATOR as a binary probe classifier that predicts whether to exit the reasoning process at each CoT token. By analyzing the token-level confidence and usage distribution, they identified the first logical arrival of the final answer, which served as a signal for early-exiting. A novel dataset was constructed to train this classifier, allowing for effective inference-time early-exit decisions.
Results
TERMINATOR achieved a reduction in CoT lengths ranging from 14% to 55% across four practical datasets, demonstrating its effectiveness in optimizing reasoning processes. The method outperformed current state-of-the-art approaches, showcasing significant improvements in computational efficiency while maintaining high accuracy.
Implications
The findings suggest that implementing early-exit strategies like TERMINATOR can lead to more efficient use of computational resources in LRMs, making them more practical for real-world applications. This approach could be particularly beneficial in scenarios where computational costs are critical, such as in large-scale deployments of AI systems.
SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design
NLP
Large Language Models
Reinforcement Learning
- Introduction of SciDesignBench, a benchmark with 520 tasks across 14 scientific domains for evaluating language models in inverse design.
- Demonstration that existing language models struggle with one-turn de novo design, achieving only 29% success.
- Long-horizon feedback utilization is a distinct capability, with different models excelling in various task settings.
- Introduction of RLSF, a simulator-feedback training recipe that improves model performance in scientific design tasks.
Read more
SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design
Summary
The paper introduces SciDesignBench, a comprehensive benchmark designed to evaluate and enhance the capabilities of language models in solving scientific inverse design problems. Inverse design involves finding a suitable input that achieves a desired output, which is a challenging task due to the complexity of the design space. SciDesignBench consists of 520 simulator-grounded tasks across 14 scientific domains, categorized into various evaluation settings such as single-shot design and long-horizon refinement. The authors evaluate seven advanced language models, revealing that while these models possess substantial scientific knowledge, they struggle with one-turn de novo design tasks, achieving only 29% success. The study highlights the importance of simulator feedback, which varies in effectiveness depending on the task's complexity and the model's capabilities. Additionally, the authors propose a novel training approach, RLSF, which utilizes simulator feedback to improve model performance. The results demonstrate that a model trained with RLSF significantly enhances success rates in specific scientific design domains, showcasing the potential of simulator-backed training for advancing language model capabilities in scientific applications.
Methodology
The authors developed SciDesignBench, comprising 520 tasks with a frozen forward oracle for validation. They evaluated seven frontier language models across five evaluation modes. The RLSF training approach involved fine-tuning a model with simulator-generated (goal, design) pairs followed by applying Group Relative Policy Optimization using simulator feedback as the reward signal.
Results
The evaluation revealed that the best zero-shot model achieved only 29% success in one-turn de novo design. The introduction of simulator feedback improved performance, with RLSF-tuned models showing an increase in success rates by 8-17 percentage points across three domains. Specific improvements included a rise from 30% to 41% in ADMET optimization and from 42% to 59% in molecular docking.
Implications
The findings suggest that while frontier language models have significant scientific knowledge, they require further development to effectively tackle inverse design problems. The RLSF training approach could serve as a foundation for enhancing model capabilities in scientific applications, potentially leading to more effective designs in fields such as drug discovery and materials science.
Robust Self-Training with Closed-loop Label Correction for Learning from Noisy Labels
Theory
Optimization
Efficient ML
- Introduces a self-training framework that synergistically optimizes a classifier and a label correction function.
- Provides theoretical guarantees for model stability during noise correction.
- Achieves state-of-the-art performance on CIFAR and Clothing1M datasets with reduced training time.
- Utilizes intermediate feature representations for richer information transfer.
Read more
Robust Self-Training with Closed-loop Label Correction for Learning from Noisy Labels
Summary
This paper addresses the challenge of training deep neural networks with noisy labels, which often leads to performance degradation. The authors propose a novel self-training label correction framework that utilizes decoupled bilevel optimization, allowing a classifier and a neural correction function to co-evolve. By leveraging a small clean dataset, the method employs noisy posterior simulation and intermediate features to transfer ground-truth knowledge, forming a closed-loop feedback system that mitigates error amplification. Theoretical guarantees are provided to support the stability of the approach. Extensive experiments on benchmark datasets such as CIFAR and Clothing1M demonstrate that the proposed method achieves state-of-the-art performance while reducing training time, showcasing its practical applicability in scenarios with limited clean data. The framework enhances the utilization of noisy samples and avoids the high computational costs associated with existing methods.
Methodology
The proposed method employs a self-training framework where a classifier and a neural correction function co-optimize using clean labels to guide the correction of noisy labels. It utilizes noisy posterior simulation and intermediate features to facilitate the transfer of ground-truth knowledge, forming a closed-loop feedback system that prevents error amplification.
Results
The experiments conducted on CIFAR and Clothing1M datasets show that the proposed method not only achieves significant performance improvements over existing state-of-the-art methods but also reduces training time, validating its effectiveness in large-scale scenarios with limited clean data.
Implications
The findings suggest that the proposed framework can be effectively applied in real-world scenarios where obtaining clean labels is challenging, enhancing the robustness of models trained on noisy datasets. This has potential applications in various domains, including image classification and other machine learning tasks where label noise is prevalent.
A Kolmogorov-Arnold Surrogate Model for Chemical Equilibria: Application to Solid Solutions
Efficient ML
- Introduction of Kolmogorov-Arnold networks as a surrogate model for chemical equilibria.
- First application of data-driven models to co-precipitation with radionuclide incorporation.
- Significant reduction in prediction errors compared to traditional multilayer perceptrons.
- Demonstration of KANs' effectiveness in handling complex thermodynamic systems.
Read more
A Kolmogorov-Arnold Surrogate Model for Chemical Equilibria: Application to Solid Solutions
Summary
This paper addresses the computational challenges associated with geochemical solvers in reactive transport simulations, which often require extensive chemical calculations. The authors propose a novel surrogate model based on Kolmogorov-Arnold networks (KANs), which utilize learnable spline-based functions instead of traditional fixed activation functions. This approach aims to enhance accuracy while reducing the number of trainable parameters. The study begins by training the KAN surrogate model on a benchmark cement system and subsequently applies it to the geological disposal of nuclear waste, specifically focusing on the solubility of radionuclide-bearing solids. The paper is notable for being the first to explore co-precipitation with radionuclide incorporation using data-driven surrogate models, addressing varying levels of thermodynamic complexity in solid solutions. The results demonstrate that KANs significantly outperform multilayer perceptrons (MLPs), achieving a 62% reduction in absolute error and a 59% reduction in relative error on the cement benchmark. For the binary and ternary solid solution models, KANs maintain median prediction errors around 1 × 10−3, marking a promising step towards accelerating reactive transport simulations and improving safety assessments for deep geological waste repositories.
Methodology
The authors trained a Kolmogorov-Arnold network surrogate model on a benchmark cement system and applied it to radionuclide-bearing solid solubility calculations. The model replaces fixed activation functions with learnable spline-based functions to improve accuracy and reduce parameters. Performance was compared against multilayer perceptrons using error metrics.
Results
The Kolmogorov-Arnold network outperformed multilayer perceptrons, achieving a 62% reduction in absolute error and a 59% reduction in relative error on the cement benchmark. For the binary and ternary solid solution models, KANs maintained median prediction errors near 1 × 10−3, indicating high accuracy.
Implications
The findings suggest that KANs can significantly speed up reactive transport simulations, which is crucial for safety assessments in deep geological waste repositories. This advancement could lead to more efficient modeling of geochemical processes in various subsurface applications.
Massive Redundancy in Gradient Transport Enables Sparse Online Learning
Theory
Efficient ML
Time Series
- Only 6% of the recurrent Jacobian paths are needed to recover 84% of full RTRL's adaptation ability.
- The redundancy in gradient transport is robust across various tasks and architectures, including LSTMs and transformers.
- Sparse RTRL is more numerically stable than full RTRL in chaotic dynamics, while still effective in non-chaotic tasks.
- The findings suggest that the optimization of Jacobian propagation may not be necessary due to inherent redundancy.
Read more
Massive Redundancy in Gradient Transport Enables Sparse Online Learning
Summary
This paper investigates the efficiency of Real-Time Recurrent Learning (RTRL) in online learning scenarios, particularly focusing on the redundancy in the recurrent Jacobian. The author demonstrates that only a small fraction (6%) of the Jacobian paths is necessary to recover a significant portion (84±6%) of the adaptation ability of full RTRL, even as network sizes increase. This finding suggests that the gradient transport in recurrent neural networks (RNNs) is highly redundant, allowing for sparse propagation methods that are computationally cheaper and more stable. The study extends its findings to various architectures, including LSTMs and transformers, and shows that the redundancy persists across different tasks and even in real primate neural data. The results indicate that careful optimization of Jacobian propagation may be unnecessary, as the structural redundancy can be leveraged for efficient online learning.
Methodology
The study employs empirical analysis to assess the effectiveness of sparse Jacobian propagation in RNNs. By randomly selecting a small subset of Jacobian paths, the author evaluates the recovery of adaptation ability across various tasks and architectures. The analysis includes spectral properties of the Jacobian and performance comparisons with full RTRL in both chaotic and non-chaotic scenarios.
Results
The results show that propagating through just 4 out of 64 Jacobian paths (6%) recovers 78-84% of the adaptation ability of full RTRL. This recovery is consistent across different network sizes and architectures, with k=4 being effective from n=64 to n=256. Sparse RTRL outperforms full RTRL in terms of numerical stability on chaotic tasks, and the redundancy extends to LSTMs and transformers, where sparse gradient transport also shows improved performance.
Implications
The findings suggest that neural network architectures can be designed with more efficient gradient transport mechanisms, reducing computational costs in online learning scenarios. This could lead to advancements in real-time applications such as robotics and adaptive systems, where computational efficiency is crucial. Additionally, the insights into redundancy may inform future research on optimizing learning algorithms across various domains.
Mamba-3: Improved Sequence Modeling using State Space Principles
Large Language Models
Efficient ML
NLP
- Mamba-3 achieves improved inference efficiency and model quality compared to Transformer-based models.
- Introduces a complex-valued state update rule for enhanced state tracking capabilities.
- Utilizes a MIMO formulation to improve computational efficiency during decoding.
- Demonstrates significant accuracy gains in downstream language modeling tasks.
Read more
Mamba-3: Improved Sequence Modeling using State Space Principles
Summary
Mamba-3 introduces significant advancements in sequence modeling by leveraging state space model (SSM) principles to enhance inference efficiency and model quality. The paper identifies the limitations of existing Transformer-based models, particularly their quadratic compute and linear memory requirements, which hinder their practical deployment. To address these issues, Mamba-3 incorporates three methodological innovations: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that improves state tracking capabilities, and (3) a multi-input, multi-output (MIMO) formulation that enhances performance without increasing decode latency. These improvements lead to substantial gains in various tasks, including retrieval and state tracking, while maintaining lower memory usage compared to its predecessor, Mamba-2. The model demonstrates a notable increase in downstream accuracy and perplexity performance, showcasing its potential to push the performance-efficiency frontier in large language models.
Methodology
The methodology of Mamba-3 is centered around three core innovations: (1) an expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule for better state tracking, and (3) a MIMO approach to state updates that enhances computational efficiency. These innovations are grounded in an inference-first perspective, focusing on optimizing model performance during deployment.
Results
At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points over Gated DeltaNet, with the MIMO variant achieving an additional 1.2 points for a total gain of 1.8 points. Mamba-3 also matches the perplexity of Mamba-2 while using half of its state size, demonstrating its efficiency and effectiveness in various tasks.
Implications
The advancements presented in Mamba-3 have significant implications for the deployment of large language models, particularly in scenarios requiring efficient inference and high model quality. The model's ability to maintain performance while reducing memory usage could facilitate broader applications in real-time AI systems and large-scale deployments.
Unlearning-based sliding window for continual learning under concept drift
Theory
Efficient ML
Time Series
- Introduces UIL, a framework that utilizes machine unlearning for efficient continual learning under concept drift.
- Demonstrates that unlearning outdated data followed by incremental adaptation can mimic the performance of full retraining with lower computational costs.
- Empirical results indicate that the proposed method is competitive with existing sliding-window approaches in various drift scenarios.
Read more
Unlearning-based sliding window for continual learning under concept drift
Summary
This paper addresses the challenge of continual learning in nonstationary environments where concept drift occurs. Traditional machine learning models often assume a stationary data distribution, which is not the case in many real-world applications. The authors propose a novel approach that combines machine unlearning with a sliding window mechanism to efficiently manage the influence of outdated data while adapting to new information. The proposed framework, UIL (Unlearned and Iteratively trained cLassifier), allows for targeted forgetting of obsolete data without the need for complete retraining of the model. This method significantly reduces computational costs associated with standard sliding-window retraining techniques. The authors provide a theoretical analysis demonstrating that their approach can approximate the predictive performance of retraining from scratch while being more efficient. Empirical results from experiments on image classification tasks under various drift scenarios show that the UIL framework performs competitively against traditional methods, highlighting its effectiveness in adapting to evolving data distributions.
Methodology
The authors develop the UIL framework, which employs machine unlearning to remove the influence of outdated samples from a trained model. This is followed by an incremental update with new data, allowing for efficient adaptation without complete retraining. The methodology includes a theoretical analysis of the unlearning process and empirical validation through experiments on image classification benchmarks.
Results
The experiments conducted on image classification tasks demonstrate that the UIL framework effectively manages concept drift, showing competitive performance compared to traditional sliding-window retraining methods. The results indicate that the proposed approach is not only effective but also computationally efficient, making it a viable alternative for continual learning in nonstationary environments.
Implications
The findings suggest that integrating machine unlearning into continual learning frameworks can enhance the adaptability of models in dynamic environments, potentially benefiting applications in fields such as autonomous systems, real-time analytics, and any domain where data distributions change over time.
Linear Predictability of Attention Heads in Large Language Models
Large Language Models
Efficient ML
Interpretability
- Pretrained LLMs exhibit strong linear predictability among attention heads, particularly in their QKV activations.
- This predictability emerges during pretraining and is absent at random initialization.
- The authors achieve significant KV cache compression by caching only reference heads and reconstructing others on-the-fly.
- Mean R² values indicate high fidelity in reconstructing target heads from reference heads, with values often exceeding 0.76.
Read more
Linear Predictability of Attention Heads in Large Language Models
Summary
This paper investigates the linear predictability of attention heads in pretrained large language models (LLMs), addressing the inefficiencies in KV cache during inference. The authors demonstrate that the Query, Key, and Value (QKV) vectors of attention heads can often be reconstructed as linear combinations of a small number of peer heads within the same layer. This predictability is not present at random initialization but emerges during pretraining, with a significant increase in intra-layer alignment of Key projection subspaces. The study shows that using only a few reference heads can achieve high fidelity in reconstructing target heads, with mean R² values indicating strong predictability. The authors propose a practical application of this finding by compressing the KV cache, allowing for a 2× reduction in memory usage while maintaining acceptable accuracy trade-offs across various benchmarks. The results highlight the potential for optimizing LLM performance without architectural changes, emphasizing the learned nature of these relationships during training.
Methodology
The authors conducted experiments on various pretrained LLMs (Llama-3.1 8B, Falcon3-10B, OLMo-2-7B, and Qwen3-32B) to analyze the linear relationships among attention head activations. They tracked the emergence of predictability during pretraining using intermediate checkpoints and established a theoretical lower bound for predictability at initialization. The practical application involved compressing the KV cache by storing only reference heads and reconstructing target heads using learned linear maps.
Results
The study found that 2-5 reference heads could reconstruct many target heads with high fidelity, achieving mean R² values of approximately 0.76 for Keys on the C4 dataset and frequently exceeding 0.85 on GSM8K. The proposed KV cache compression method resulted in a 2× reduction in memory usage, with model-dependent accuracy trade-offs ranging from 4.5 to 9.9 percentage points across different benchmarks.
Implications
The findings suggest that understanding the linear predictability of attention heads can lead to more efficient LLM architectures and inference processes. By reducing the memory footprint of KV caches, the proposed method can enhance the performance of LLMs in real-time applications without requiring significant architectural changes.
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
Federated Learning
Efficient ML
Theory
- FOUL introduces a two-stage framework for efficient Federated Unlearning.
- The learning-to-unlearn stage prepares the model for unlearning by encoding key features.
- On-server unlearning preserves privacy and reduces computational overhead.
- A new metric, time-to-forget, quantifies the speed of unlearning effectiveness.
Read more
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
Summary
This paper addresses the challenges of Federated Unlearning (FUL), which aims to remove specific clients' data contributions from a trained Federated Learning model to ensure data privacy and compliance with regulations. The authors propose a novel framework called Federated On-server Unlearning (FOUL), which consists of two main stages: a learning-to-unlearn phase that identifies and encodes key features associated with clients to be forgotten, and an on-server knowledge aggregation phase that performs unlearning without requiring access to client data. This approach enhances communication efficiency and preserves privacy. The authors introduce a new evaluation metric, 'time-to-forget', to measure the speed of achieving optimal unlearning performance. Experimental results demonstrate that FOUL outperforms traditional retraining methods in various unlearning scenarios, achieving competitive results with significantly reduced time-to-forget while maintaining low computational and communication costs.
Methodology
The FOUL framework consists of two stages: (1) Learning-to-Unlearn (L2U), which disentangles the feature extractor into causal and non-causal sub-networks to capture domain-invariant and domain-specific information, respectively; and (2) On-server knowledge aggregation, which aligns gradients from retain clients while conflicting with those from forget clients to facilitate effective unlearning without accessing client data.
Results
Extensive experiments on three datasets reveal that FOUL outperforms nine existing FUL methods, achieving better unlearning performance with a significantly reduced time-to-forget, while also maintaining low communication and computation costs.
Implications
The proposed FOUL framework has significant implications for privacy-preserving machine learning applications, particularly in contexts where compliance with data protection regulations is critical. It enables efficient unlearning processes in federated learning settings, making it more feasible for real-world applications.
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group
Theory
- Introduces a new method for solving the steerable kernel constraint in equivariant CNNs.
- Provides explicit bases for different symmetry groups and tensor types.
- Eliminates the need for complex computations involving Clebsch-Gordan coefficients.
- Demonstrates the method's applicability across various symmetry groups.
Read more
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group
Summary
This paper presents a novel approach to addressing the steerable kernel constraint in the design of steerable equivariant convolutional neural networks (CNNs). The author introduces explicit real and complex bases that can be utilized for various symmetry groups and feature maps of arbitrary tensor types. A significant advantage of this method is its ability to avoid the computational complexities associated with Clebsch-Gordan coefficients, instead working directly with the representations of input and output feature maps. The proposed strategy involves identifying a basis of kernels that adhere to a simpler invariance condition at a specific point, followed by steering it to any arbitrary point using the steerability equation. This approach is elaborated with minimal technical jargon to ensure accessibility for a broader audience. The paper also discusses examples across different symmetry groups, including SO(2), O(2), SO(3), O(3), and the Lorentz group SO(1, 3), illustrating the application of the proposed method in various contexts, including the treatment of massive and massless particles.
Methodology
The methodology involves finding a basis of kernels that satisfy a simplified invariance condition at a specific point, which is then steered to any arbitrary point using the steerability equation. This approach leverages the representations of feature maps rather than relying on traditional computational techniques.
Results
The paper successfully demonstrates the construction of explicit bases for steerable kernels applicable to various symmetry groups, showcasing the method's effectiveness through examples. The results indicate that this approach simplifies the design of equivariant CNNs while maintaining their performance advantages.
Implications
The findings have significant implications for the design of CNNs that incorporate symmetry, potentially enhancing their performance in tasks requiring equivariance, such as predicting physical phenomena in molecular systems or other applications in physics and engineering.
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization
Time Series
Efficient ML
Interpretability
- Plume is a 140M-parameter model tailored for 802.11 wireless traces, utilizing structured PDML dissections.
- The protocol-aware tokenizer significantly reduces sequence length and increases information density compared to BPE.
- Plume achieves high accuracy in next-packet prediction and anomaly detection, outperforming larger models in efficiency.
- The model supports on-premises deployment, enhancing privacy and enabling real-time analysis.
Read more
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization
Summary
The paper introduces Plume, a compact 140M-parameter foundation model specifically designed for analyzing 802.11 wireless packet traces. Unlike traditional models that treat packet data as flat strings, Plume employs a protocol-aware tokenizer that respects the structured nature of wireless data, including layered headers and timing gaps. This tokenizer generates significantly shorter sequences with higher information density compared to standard methods like Byte Pair Encoding (BPE). Trained on a curated dataset, Plume demonstrates impressive performance, achieving 74-97% accuracy in next-packet prediction across various failure categories and an area under the receiver operating characteristic curve (AUROC) of at least 0.99 for zero-shot anomaly detection. The model's efficiency is highlighted by its ability to perform comparably to larger models like Claude Opus and GPT-5 while being over 600 times smaller, making it suitable for on-premises deployment and privacy-preserving analysis. The paper emphasizes the importance of structured data representation and quality in training, proposing a proactive data capture strategy to enhance model performance. Plume is designed to be integrated into multi-agent workflows, providing structured insights for root cause analysis in network environments.
Methodology
The authors developed Plume by employing a protocol-aware tokenizer that splits data along the structured field tree of PDML exports, incorporates gap tokens for timing, and normalizes identifiers. The model was trained on a curated dataset using a GPT-style auto-regressive objective, focusing on the unique characteristics of wireless packet data. A proactive intelligent capture strategy was also implemented to ensure high-quality training data.
Results
Plume achieved 74-97% accuracy in next-packet prediction across five real-world failure categories and an AUROC of at least 0.99 for zero-shot anomaly detection. It demonstrated comparable performance to larger models while being over 600 times smaller, processing approximately 200 packets per second on a single GPU.
Implications
Plume's design allows for efficient, privacy-preserving analysis of wireless network data, making it suitable for real-time root cause analysis in various network environments. Its compact size and high performance enable deployment in resource-constrained settings, potentially transforming how network issues are diagnosed and resolved.
Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions
Generative Models
Time Series
- Synthetic data generation can alleviate data scarcity in BCIs.
- The paper categorizes generative algorithms into four methodological types.
- Benchmarking of existing algorithms across various BCI paradigms is conducted.
- Challenges such as data heterogeneity and privacy concerns are discussed.
Read more
Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions
Summary
This paper provides a comprehensive review of synthetic data generation techniques for brain-computer interfaces (BCIs), addressing the challenges posed by limited, heterogeneous, and privacy-sensitive neural recordings. The authors categorize existing generative algorithms into four types: knowledge-based, feature-based, model-based, and translation-based approaches. They benchmark these algorithms across four BCI paradigms—motor imagery, epileptic seizure detection, and others—to objectively compare their performance. The paper highlights the importance of generating physiologically plausible brain signals to enhance model capacity and mitigate data scarcity. Furthermore, it discusses the potential and challenges of current generation approaches, emphasizing the need for accurate, data-efficient, and privacy-aware BCI systems. A public benchmark codebase is provided to facilitate further research in this area.
Methodology
The authors systematically review and categorize existing brain signal generation techniques into four types. They conduct benchmark experiments to evaluate the performance of these techniques across different BCI paradigms, using established evaluation metrics.
Results
The benchmarking results provide an objective comparison of the performance of various synthetic data generation approaches, revealing strengths and weaknesses in their applicability to different BCI scenarios.
Implications
The findings suggest that improved synthetic data generation techniques could enhance the robustness and generalization of BCI systems, leading to better performance in real-world applications. This could have significant implications for neurorehabilitation, medical diagnostics, and user interface design.
CrossADR: enhancing adverse drug reactions prediction for combination pharmacotherapy with cross-layer feature integration and cross-level associative learning
Graph Learning
- CrossADR improves ADR prediction accuracy for combination pharmacotherapy.
- Utilizes a gated-residual-flow graph neural network for feature integration.
- Introduces a learnable ADR embedding space to capture dynamic biological correlations.
- Evaluated on a comprehensive dataset with 1,376 drugs and 946,000 combinations.
Read more
CrossADR: enhancing adverse drug reactions prediction for combination pharmacotherapy with cross-layer feature integration and cross-level associative learning
Summary
The paper presents CrossADR, a novel hierarchical framework aimed at improving the prediction of adverse drug reactions (ADRs) in combination pharmacotherapy. The challenge of accurately predicting ADRs arises from the complexity of physiological responses and the vast search space of drug combinations. Existing graph-based architectures often fail to integrate multi-scale biological information effectively and rely on fixed association matrices, limiting their predictive capabilities. CrossADR addresses these issues by employing a gated-residual-flow graph neural network that fuses multi-scale molecular features and utilizes a learnable ADR embedding space to dynamically capture biological correlations across 15 organ systems. The framework is evaluated on the newly constructed CrossADR-Dataset, which includes 1,376 drugs and 946,000 unique combinations. The results demonstrate that CrossADR achieves state-of-the-art performance across 80 experimental scenarios, providing high-resolution insights into drug-related protein-protein interactions and pathways. This work represents a significant advancement in the integration of cross-scale biomedical information and offers a robust tool for enhancing clinical decision-making regarding drug safety.
Methodology
The CrossADR framework employs a hierarchical architecture that integrates cross-layer feature integration and cross-level associative learning. It utilizes a gated-residual-flow graph neural network to fuse multi-scale molecular features and a learnable ADR embedding space to capture organ-level information dynamically. The model incorporates drug-dependent attention scores and bi-directional cross-attention mechanisms, moving away from fixed association matrices to enhance generalization across diverse datasets.
Results
CrossADR consistently outperformed state-of-the-art deep learning models and traditional machine learning methods across 80 distinct experimental scenarios. The evaluation demonstrated its effectiveness in predicting ADRs and provided insights into drug-related protein-protein interactions and pathways. Ablation studies confirmed the significance of the architectural innovations, particularly the gated modules and multi-scale fusion.
Implications
CrossADR has the potential to significantly enhance clinical decision-making by providing accurate predictions of ADRs in combination pharmacotherapy. Its ability to integrate multi-scale biological information can aid in drug development and improve patient safety in clinical settings.
Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
Reinforcement Learning
- Introduces Swap-guided Preference Learning (SPL) to enhance personalization in RLHF.
- Addresses the issue of posterior collapse in Variational Preference Learning (VPL).
- Utilizes fictitious swap annotators to improve user-specific latent variable encoding.
- Implements three key components: swap-guided base regularization, P-IAF, and adaptive latent conditioning.
Read more
Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
Summary
This paper addresses the limitations of Reinforcement Learning from Human Feedback (RLHF), which typically relies on a single universal reward function that fails to capture the diversity of human preferences. The authors introduce Swap-guided Preference Learning (SPL), a novel framework that mitigates the issue of posterior collapse observed in Variational Preference Learning (VPL). SPL employs fictitious swap annotators to enhance the encoding of user-specific latent variables, thereby allowing for a more personalized alignment of AI systems with human values. The framework incorporates three main components: swap-guided base regularization, Preferential Inverse Autoregressive Flow (P-IAF), and adaptive latent conditioning. These innovations collectively improve the representation of user preferences and reduce the risk of collapsing to a single reward model. Experiments demonstrate that SPL effectively enriches user-specific latents and enhances preference prediction, thereby promoting fairness and accuracy in AI systems aligned with diverse human values.
Methodology
The SPL framework leverages structural properties of preference data through guiding mechanisms and a Preferential Inverse Autoregressive Flow (P-IAF). It introduces regularization techniques to encourage mirrored characteristics in latent space and dynamically adjusts the contribution of latent variables to reward predictions.
Results
Experiments show that SPL significantly mitigates posterior collapse, enriches user-specific latent representations, and improves the accuracy of preference predictions compared to traditional RLHF and VPL methods.
Implications
The findings suggest that SPL can lead to more equitable AI systems by better capturing the diversity of human preferences, thus enhancing the alignment of AI behaviors with varied human values. This has potential applications in fields requiring personalized AI interactions, such as natural language processing and recommendation systems.
Translational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT
Graph Learning
Time Series
Interpretability
- GT-BEHRT improves EHR representation by encoding intra-visit relationships.
- The framework achieved high discrimination metrics but revealed significant translational gaps.
- Key gaps include lack of calibration analysis and incomplete fairness auditing.
- The study emphasizes the need for comprehensive evaluation before clinical deployment.
Read more
Translational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT
Summary
This paper critically evaluates GT-BEHRT, a graph-transformer framework designed for predicting outcomes in longitudinal electronic health records (EHRs). While transformer architectures have improved predictive modeling through self-supervised pretraining, many implementations overlook the intra-visit relational structure by representing clinical encounters as unordered code sets. GT-BEHRT addresses this by encoding visit-level relationships while maintaining long-range temporal dependencies. The study assesses GT-BEHRT's performance on heart failure prediction using data from the All of Us Research Program and MIMIC-IV. A systematic appraisal across seven dimensions, including representation fidelity and fairness auditing, reveals six translational gaps that hinder clinical applicability, such as the absence of calibration analysis and incomplete fairness auditing. Despite achieving high discrimination metrics (AUROC 94.37, AUPRC 73.96, F1 64.70), the findings suggest that further evaluation is needed to ensure GT-BEHRT's readiness for clinical deployment. The paper concludes that while GT-BEHRT represents a significant advancement in EHR representation learning, it requires calibration-aware evaluation, formal fairness auditing, and transparent cohort reporting to credibly inform clinical decision-making.
Methodology
The authors conducted a systematic appraisal of GT-BEHRT across seven evaluation dimensions, including representation design, pretraining strategy, cohort construction, metric sufficiency, fairness auditing, reproducibility infrastructure, and deployment feasibility. The assessment was anchored to TRIPOD guidelines and contemporary machine learning fairness frameworks.
Results
GT-BEHRT achieved an AUROC of 94.37 ± 0.20, AUPRC of 73.96 ± 0.83, and F1 score of 64.70 ± 0.85 for heart failure prediction within 365 days. However, six translational gaps were identified that limit its clinical applicability.
Implications
The findings highlight the need for improved evaluation frameworks in medical AI, particularly in calibration, fairness, and deployment feasibility, to ensure that advanced predictive models can be effectively integrated into clinical practice.
Maximizing Incremental Information Entropy for Contrastive Learning
Computer Vision
Theory
Efficient ML
- Introduces a new theoretical framework for contrastive learning focusing on incremental information entropy.
- Proposes a dual optimization strategy involving a learnable transformation and an encoder regularizer.
- Demonstrates improved performance in small-batch settings across multiple datasets.
- Provides a plug-and-play capability for enhancing existing self-supervised models.
Read more
Maximizing Incremental Information Entropy for Contrastive Learning
Summary
This paper introduces Incremental-Entropy Contrastive Learning (IE-CL), a novel framework designed to enhance contrastive learning by maximizing the incremental information entropy between augmented views while maintaining semantic consistency. The authors identify the encoder as an information bottleneck and propose a dual optimization strategy that includes a learnable transformation for entropy generation and an encoder regularizer for entropy preservation. This approach addresses the limitations of existing contrastive learning methods that rely on static augmentations and large batch sizes, which can impose rigid constraints on learning dynamics. The proposed framework is empirically validated on datasets such as CIFAR-10/100, STL-10, and ImageNet, demonstrating significant performance improvements, particularly in small-batch settings. The core components of IE-CL can also be integrated into existing self-supervised models, bridging theoretical insights with practical applications in representation learning.
Methodology
The methodology involves a theoretical framework that identifies the encoder as an information bottleneck. The authors propose the Sample Augmentation Incremental Block (SAIB) to generate entropy at the input level and an encoder regularizer to preserve this entropy during encoding. The optimization process incorporates a Kullback-Leibler divergence constraint to balance entropic expansion and semantic consistency.
Results
The experiments conducted on CIFAR-10/100, STL-10, and ImageNet show that IE-CL consistently outperforms existing contrastive learning methods, especially in scenarios with small batch sizes. The results indicate that the proposed framework effectively enhances representation learning by maximizing conditional entropy gain while ensuring semantic coherence.
Implications
The findings suggest that IE-CL can significantly advance self-supervised representation learning by providing a more flexible and principled approach to contrastive learning. This framework could lead to more efficient training processes and better performance in various applications, including computer vision tasks where labeled data is scarce.
Continual Fine-Tuning with Provably Accurate and Parameter-Free Task Retrieval
Theory
Efficient ML
NLP
- Introduces PROTEUS, a parameter-free task retrieval framework for continual fine-tuning.
- Provides theoretical guarantees linking retrieval error to clustering properties of task representations.
- Combines adaptive knowledge transfer with a clustering-based retrieval mechanism.
- Demonstrates significant performance improvements over existing continual learning methods.
Read more
Continual Fine-Tuning with Provably Accurate and Parameter-Free Task Retrieval
Summary
This paper addresses the challenge of continual fine-tuning in machine learning, where models must adapt to new tasks sequentially while retaining knowledge from previous tasks without access to their data. Existing methods either adapt input prompts at test time, risking forgetting, or use fixed input embeddings, sacrificing adaptability. The authors propose a novel parameter-adaptation method called PROTEUS, which combines the strengths of both approaches by enabling adaptive input embedding use during inference without requiring continuous retrieval function learning. The framework includes a theoretical analysis of retrieval error rates based on clustering structures of task-specific representation patterns, ensuring low retrieval error and improved performance. The method incorporates an adaptive module composition strategy for task-specific updates and a clustering-based retrieval mechanism to enhance representation adaptability. Extensive experiments demonstrate that PROTEUS significantly outperforms existing state-of-the-art methods, particularly in scenarios with substantial shifts in task semantics.
Methodology
The authors develop a theoretical framework for analyzing retrieval error rates in parameter-free, signature-based representation retrieval methods. They implement an adaptive knowledge transfer mechanism that enhances cluster separation for task representations, allowing for more effective retrieval of task-specific adaptation modules during inference. The method leverages clustering structures to ensure low retrieval error and improve predictive performance.
Results
The experiments show that PROTEUS achieves a provably low retrieval error rate and outperforms existing continual learning methods, particularly in scenarios with significant semantic shifts across tasks. The adaptive components of the framework work synergistically to enhance both retrieval accuracy and predictive performance.
Implications
The findings suggest that PROTEUS can be applied in various continual learning scenarios, particularly in environments where tasks evolve over time and data from previous tasks is unavailable. This has potential applications in fields such as natural language processing, computer vision, and any domain requiring adaptive learning from sequential data.
MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers
Optimization
Efficient ML
- Introduction of MONET, a framework for modeling neural network training on heterogeneous dataflow accelerators.
- Demonstration of MONET's capabilities through case studies on ResNet-18 and GPT-2.
- Exploration of the training design space and optimization of layer-fusion configurations.
- Application of a genetic algorithm to solve activation checkpointing challenges.
Read more
MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers
Summary
The paper presents MONET, a novel framework aimed at modeling and optimizing the training of neural networks across various hardware configurations, from edge devices to data centers. While significant advancements have been made in optimizing neural network inference, the training phase remains less explored, despite its critical resource demands. MONET extends the existing Stream framework, which focuses on inference, to accommodate the unique challenges of training, such as memory constraints and backpropagation complexity. The authors demonstrate MONET's capabilities through case studies involving ResNet-18 and a small GPT-2 model, showcasing its effectiveness in exploring the training design space and identifying optimal hardware architectures. The framework also addresses complex issues like layer-fusion configurations and activation checkpointing, employing a genetic algorithm to optimize these processes. The findings emphasize the necessity for a comprehensive approach to hardware-software co-design to enhance the efficiency and scalability of deep learning deployments.
Methodology
The authors extend the Stream framework to create MONET, which models training workloads by considering factors such as neural network architecture, hyperparameters, and hardware characteristics. They utilize case studies to illustrate the framework's application and employ a genetic algorithm for optimizing activation checkpointing.
Results
MONET successfully models the training workflows of ResNet-18 and GPT-2, revealing better hardware configurations and trade-offs in activation checkpointing. The results indicate significant differences in energy and latency distributions between training and inference, underscoring the need for training-specific evaluations.
Implications
The development of MONET has the potential to improve the efficiency of neural network training across various deployment scenarios, leading to reduced costs and environmental impact. It can be particularly beneficial for applications in edge computing and large-scale data centers, where resource optimization is critical.
Is the reconstruction loss culprit? An attempt to outperform JEPA
Theory
Generative Models
Time Series
- JEPA-style predictive learning is generally more robust to noise than reconstruction-based autoencoders.
- Autoencoder performance is heavily influenced by objective asymmetries and bottleneck effects.
- Gated predictive autoencoders can effectively select predictable components, improving stability and performance.
- The study underscores the necessity of rigorous evaluation methods in representation learning research.
Read more
Is the reconstruction loss culprit? An attempt to outperform JEPA
Summary
This paper investigates the effectiveness of JEPA-style predictive representation learning compared to reconstruction-based autoencoders within a controlled linear dynamical system, termed the 'TV-series' testbed. Initial findings suggest that JEPA exhibits greater robustness to noise than autoencoders. However, further analysis reveals that the performance of autoencoders is significantly affected by asymmetries in objectives and bottleneck effects, as evidenced by PCA baselines. To address these issues, the authors propose gated predictive autoencoders that can learn to select predictable components, emulating the feature-selection advantages seen in over-parameterized PCA. The results indicate that this gated model remains stable across varying noise levels and either matches or surpasses the performance of JEPA. The study emphasizes the importance of thorough evaluation in representation learning experiments, highlighting that conclusions can be misleading without proper diagnostics and baseline comparisons.
Methodology
The authors conducted a series of targeted comparisons and diagnostic experiments using a controlled linear dynamical system. They added latent-space prediction terms to autoencoder objectives, tested latent-dynamics modeling objectives, performed ablations to isolate the role of reconstruction, and utilized classical linear baselines like PCA for contextualization.
Results
The gated predictive autoencoder outperformed traditional autoencoders and matched or exceeded the performance of JEPA across various noise levels. The findings revealed that representation learning outcomes can be significantly affected by the choice of evaluation metrics and experimental design.
Implications
This research has implications for the design of representation learning frameworks, particularly in environments where noise and variability are present. It suggests that predictive objectives may offer advantages over reconstruction objectives in certain contexts, potentially influencing future developments in world modeling and representation learning.
PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers
Computer Vision
Generative Models
Efficient ML
- PDE-SSM replaces self-attention with a learnable PDE, improving spatial inductive bias.
- The computational complexity of PDE-SSM is O(N log N), significantly more efficient than O(N^2) for self-attention.
- PDE-SSM-DiT achieves comparable or superior performance to existing diffusion transformers while reducing compute.
- The approach provides a principled generalization of state space models to multi-dimensional spatial data.
Read more
PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers
Summary
The paper introduces PDE-SSM, a novel architectural block that replaces the traditional self-attention mechanism in vision transformers with a learnable convection-diffusion-reaction partial differential equation (PDE). This approach aims to address the limitations of self-attention, particularly its quadratic computational cost and lack of spatial inductive bias. By modeling information flow through physically grounded dynamics, PDE-SSM captures spatial relationships more effectively. The authors integrate this block into a flow-matching generative model, resulting in the PDE-based Diffusion Transformer (PDE-SSM-DiT). The empirical results demonstrate that PDE-SSM-DiT matches or surpasses the performance of existing state-of-the-art diffusion transformers while significantly reducing computational requirements. The findings suggest that multi-dimensional PDE operators can serve as an efficient and inductive-bias-rich foundation for future vision models, analogous to the success of state space models in 1D settings.
Methodology
The authors propose the PDE-SSM architectural block, which models the evolution of a hidden state using a learnable PDE. The solution to this PDE is computed in the Fourier domain, allowing for efficient global coupling of information with near-linear complexity. This block is integrated into a standard transformer architecture to create the PDE-SSM-DiT model.
Results
PDE-SSM-DiT demonstrates performance that matches or exceeds that of state-of-the-art diffusion transformers, while also achieving substantial reductions in computational costs. The results indicate that the PDE-based approach effectively captures spatial relationships in generative modeling tasks.
Implications
The introduction of PDE-SSM could lead to more efficient and effective vision models that leverage spatial dynamics, potentially transforming approaches in generative modeling and other areas of computer vision.
CASHomon Sets: Efficient Rashomon Sets Across Multiple Model Classes and their Hyperparameters
Optimization
Interpretability
Efficient ML
- Introduction of CASHomon sets, extending Rashomon sets to multiple model classes.
- Development of TruVaRImp, an active learning algorithm for efficient level set estimation.
- Empirical results show TruVaRImp outperforms traditional methods in identifying CASHomon set members.
- Analysis reveals significant variability in feature importance across model classes.
Read more
CASHomon Sets: Efficient Rashomon Sets Across Multiple Model Classes and their Hyperparameters
Summary
This paper introduces the concept of CASHomon sets, which extend the traditional Rashomon sets to encompass multiple model classes and their hyperparameters in the context of combined algorithm selection and hyperparameter optimization (CASH). The authors propose TruVaRImp, a model-based active learning algorithm designed for level set estimation (LSE) with an implicit threshold, providing theoretical guarantees for its convergence. Through empirical evaluations on both synthetic and real-world datasets, TruVaRImp demonstrates superior performance in identifying members of CASHomon sets compared to naive sampling, Bayesian optimization, and other baseline methods. The study also highlights the variability in predictive multiplicity and feature importance across different model classes, challenging the conventional reliance on single model interpretations in applied machine learning.
Methodology
The authors developed TruVaRImp, a model-based active learning algorithm that estimates level sets with an implicit threshold. This approach is framed within the context of hyperparameter optimization and algorithm selection, allowing for efficient identification of well-performing models across various classes. The methodology includes theoretical accuracy guarantees and empirical comparisons against established baselines.
Results
TruVaRImp effectively identifies CASHomon set members, achieving competitive or superior results compared to naive sampling, AutoML pipelines, and other level set estimation methods across nine datasets. The analysis of predictive multiplicity indicates that the capacity of CASHomon sets can differ from traditional Rashomon sets, and feature importance values vary significantly across model classes.
Implications
The findings suggest that machine learning practitioners should consider multiple model classes and their associated interpretations rather than relying on a single model. This approach can enhance model selection based on domain knowledge and user preferences, potentially leading to more interpretable and reliable machine learning applications.
RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models
NLP
Large Language Models
Interpretability
- RXNRECer directly predicts enzyme-catalyzed reactions, bypassing the limitations of EC number reliance.
- The framework integrates protein language modeling and active learning for improved annotation accuracy.
- Significant performance improvements were observed over traditional EC-based methods.
- RXNRECer supports scalable applications in proteome-wide annotation and enzyme promiscuity identification.
Read more
RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models
Summary
The paper presents RXNRECer, a novel transformer-based ensemble framework designed to enhance the annotation of enzymatic functions by directly predicting enzyme-catalyzed reactions without relying on the traditional Enzyme Commission (EC) numbers. The authors identify the limitations of existing methods that depend on EC numbers, which introduce ambiguities and inconsistencies due to their many-to-many mappings and frequent updates. RXNRECer integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. The framework's performance is validated through extensive evaluations on curated datasets, demonstrating significant improvements over six EC-based baselines, achieving gains of 16.54% in F1 score and 15.43% in accuracy. Additionally, RXNRECer facilitates scalable proteome-wide reaction annotation, enhances specificity in refining reaction schemas, and systematically annotates previously uncurated proteins. The incorporation of large language models also provides interpretable prediction rationales, making RXNRECer a robust solution for enzyme function prediction with broad applications in enzyme research and industrial contexts.
Methodology
RXNRECer employs a transformer-based dynamic ensemble learning framework that combines protein language modeling to capture sequence semantics, an active learning strategy for efficient fine-tuning, and a dynamic ensemble module that emphasizes direct reaction-level predictions while incorporating EC-based and PLM similarity signals for robustness.
Results
The evaluations showed that RXNRECer outperformed traditional multiple sequence alignment (MSA)-based approaches, EC-based tools, and recent protein language model (PLM)-based methods, achieving a 16.54% increase in F1 score and a 15.43% increase in accuracy on curated datasets.
Implications
The capabilities of RXNRECer suggest significant potential for advancing enzyme function annotation in various biological and industrial applications, including metabolic engineering, biosynthetic pathway design, and biopharmaceutical development.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
NLP
Large Language Models
Efficient ML
- M2RNN introduces matrix-valued hidden states and non-linear transitions for improved language modeling.
- The architecture overcomes limitations of linear RNNs, particularly in state tracking and long-context retrieval.
- Empirical results demonstrate significant performance gains over existing models with smaller state sizes.
- Hybrid M2RNN models outperform traditional attention-based architectures in long-context tasks.
Read more
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
Summary
The paper introduces the Matrix-to-Matrix RNN (M2RNN), a novel architecture designed to enhance language modeling capabilities by utilizing matrix-valued hidden states and non-linear state transitions. The authors argue that while Transformers excel in parallel processing, they are constrained by their computational complexity, which limits their ability to handle tasks requiring greater expressiveness, such as entity tracking and code execution. The M2RNN architecture addresses the limitations of both linear and non-linear RNNs by expanding state sizes, which improves language modeling performance and enables efficient use of tensor cores. Empirical results show that M2RNN achieves perfect state tracking generalization at unseen sequence lengths, outperforming existing models like Gated DeltaNet hybrids by 0.4–0.5 perplexity points while using smaller state sizes. Additionally, integrating even a single M2RNN layer into existing architectures significantly boosts accuracy with minimal impact on training throughput. The findings suggest that non-linear RNN layers can serve as effective components for scalable language models, particularly in hybrid settings that combine recurrent and attention mechanisms.
Methodology
The authors developed the M2RNN architecture, which employs matrix-valued states and non-linear transitions to enhance expressiveness. They conducted empirical evaluations comparing M2RNN with existing models, focusing on language modeling performance, state tracking, and long-context retrieval capabilities. The experiments included hybrid architectures that interleave recurrent and attention layers to assess performance improvements.
Results
M2RNN achieved superior language modeling performance, with empirical results indicating a reduction in perplexity by 0.4–0.5 points compared to Gated DeltaNet hybrids. The model demonstrated perfect state tracking generalization at sequence lengths not encountered during training and outperformed state-of-the-art hybrid linear attention architectures by up to 8 points on long-context tasks.
Implications
The introduction of M2RNN suggests a promising direction for developing more expressive and efficient language models. Its ability to improve state tracking and long-context retrieval could enhance applications in natural language processing, particularly in tasks requiring complex reasoning and context management.
Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise
Theory
Efficient ML
Optimization
- Introduces a new algorithm (CR-Hvt-UCB) that is robust to both adversarial corruption and heavy-tailed noise.
- Achieves computational efficiency with O(1) per-round updates, contrasting with existing methods that are computationally expensive.
- Establishes regret bounds that are applicable even when the noise moment bounds and total corruption are unknown.
- Generalizes existing results under finite-variance assumptions, providing a more flexible framework for real-world applications.
Read more
Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise
Summary
This paper addresses the challenges of linear contextual bandits in the presence of adversarial corruption and heavy-tailed noise, which are common in real-world applications. Existing algorithms typically assume finite variance, leading to computational inefficiencies and limitations in robustness. The authors propose a novel algorithm, corruption-robust heavy-tailed UCB (CR-Hvt-UCB), which operates under a bounded (1 + ϵ)-moment condition for noise. This algorithm achieves O(1) computational cost per round, significantly improving efficiency compared to prior methods that incur O(t log T) costs. The paper establishes regret bounds that account for both the noise moment and the total corruption, demonstrating that the algorithm can maintain sublinear regret even without prior knowledge of these parameters. The results show that the proposed method not only generalizes existing finite-variance approaches but also matches or improves upon the best-known rates for linear contextual bandits with heavy-tailed noise when no corruption is present.
Methodology
The authors utilize online mirror descent (OMD) to update the algorithm based on a Huber-based loss function. This approach allows for adaptive scaling of the loss to manage the influence of adversarial corruption and heavy-tailed noise, ensuring both robustness and computational efficiency.
Results
The proposed algorithm achieves regret bounds that scale with the square root of the cumulative squared (1 + ϵ)-moment bounds of the noise, plus a linear term related to the total amount of corruption. When ϵ = 1, the results align with existing guarantees for finite-variance noise. The algorithm remains effective even without knowledge of the noise moment bounds or corruption levels.
Implications
The findings suggest that the CR-Hvt-UCB algorithm can be effectively applied in various domains where adversarial corruption and heavy-tailed noise are prevalent, such as online advertising, recommendation systems, and finance. Its computational efficiency makes it suitable for large-scale and real-time applications.
Competition-Aware CPC Forecasting with Near-Market Coverage
Time Series
Graph Learning
Optimization
- CPC forecasting is reframed as a problem of partial competition observability.
- Observable proxies for competition, including semantic, behavioral, and geographic signals, enhance forecasting accuracy.
- Competition-aware forecasting shows significant improvements in stability and error profiles, especially for high-CPC, high-volatility keywords.
- The methodology is validated using a large-scale dataset from the car-rental sector, demonstrating practical applicability.
Read more
Competition-Aware CPC Forecasting with Near-Market Coverage
Summary
This paper addresses the challenge of forecasting cost-per-click (CPC) in paid search advertising, which is influenced by a competitive landscape that is only partially observable. The authors utilize Google Ads auction logs from a concentrated car-rental market to forecast weekly CPC for 1,811 keyword series. They introduce a novel approach to approximate latent competition through observable proxies, including semantic neighborhoods, behavioral neighborhoods via Dynamic Time Warping (DTW), and geographic-intent covariates. These proxies are evaluated both as standalone covariates and as relational priors in spatiotemporal graph forecasters. The results indicate that competition-aware augmentation significantly improves forecasting stability and accuracy, particularly at medium and longer horizons where competitive dynamics are more volatile. The study emphasizes the importance of understanding competition in CPC forecasting and demonstrates that the proposed methods provide a scalable solution to enhance forecasting in auction-driven markets.
Methodology
The authors construct competition proxies using pretrained transformer-based representations for semantic neighborhoods, Dynamic Time Warping for behavioral neighborhoods, and geographic-intent covariates. They benchmark these proxies against various models, including classical statistical methods, time-series foundation models, and spatiotemporal graph neural networks, evaluating their effectiveness in forecasting CPC.
Results
The study finds that incorporating competition-aware proxies leads to improved forecasting accuracy and stability, particularly at medium and longer time horizons. The most substantial gains are observed in high-CPC and high-volatility keyword regimes, highlighting the economic importance of accurate forecasting in these areas.
Implications
The findings suggest that advertisers can enhance their CPC forecasting strategies by integrating competition-aware methods, leading to better budget management and decision-making in auction-driven markets. This approach could be applied to other sectors where competitive dynamics are similarly complex and partially observable.
L2GTX: From Local to Global Time Series Explanations
Time Series
- L2GTX is a fully model-agnostic method for generating global explanations in time series classification.
- The method aggregates local explanations from a selective set of instances to form class-wise global insights.
- L2GTX effectively reduces redundancy in explanations by clustering local temporal event primitives.
- The approach maintains high fidelity in global explanations, as evidenced by stable mean local surrogate fidelity (R2) across datasets.
Read more
L2GTX: From Local to Global Time Series Explanations
Summary
The paper introduces L2GTX, a model-agnostic method designed to generate class-wise global explanations for time series classification by aggregating local explanations derived from a representative set of time series instances. The authors identify three main limitations in existing time series explanation methods: the inadequacy of model-agnostic XAI methods for time series data, the underexplored nature of global explanation synthesis for time series, and the model-specificity of existing global approaches. L2GTX addresses these challenges by first extracting local explanations using LOMATCE, which identifies parameterized temporal event primitives such as trends and local extrema. These local explanations are clustered and consolidated to reduce redundancy, and an instance-cluster importance matrix is constructed to assess global relevance. The method then selects a diverse subset of instances that maximizes coverage of influential clusters under a user-defined budget. The selected events are aggregated to form concise, class-wise global explanations that summarize key attributes of the temporal patterns. Experimental results on six benchmark datasets demonstrate that L2GTX produces interpretable global explanations while maintaining stable global faithfulness across varying levels of explanation consolidation.
Methodology
L2GTX employs a two-step process: first, it extracts local explanations using LOMATCE, which identifies significant temporal event primitives. These local explanations are clustered to reduce redundancy, and an instance-cluster importance matrix is created to evaluate global relevance. Under a defined instance selection budget, L2GTX selects representative instances that maximize coverage of influential clusters, and aggregates the relevant events into concise global explanations.
Results
The experiments conducted on six benchmark time series datasets show that L2GTX is capable of producing compact and interpretable global explanations while ensuring stable global faithfulness, as measured by mean local surrogate fidelity (R2), across different levels of explanation consolidation.
Implications
L2GTX has significant implications for enhancing the interpretability of time series classifiers in various domains, including finance, healthcare, and sensor monitoring. By providing clear and structured explanations, it can improve trust in AI systems, facilitate error analysis, and support regulatory compliance in critical applications.
Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference
Federated Learning
Efficient ML
- FedBNN achieves lower runtime computation and memory complexity compared to traditional real-valued models.
- The framework utilizes binary weights to significantly reduce model size and inference time.
- Comprehensive evaluations show FedBNN's competitive performance against state-of-the-art federated learning methods.
- FedBNN is designed to operate efficiently under various data heterogeneity settings.
Read more
Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference
Summary
This paper introduces FedBNN, a novel framework for Federated Learning (FL) that focuses on training Binary Neural Networks (BNNs) to address the challenges of computational and memory efficiency during inference on resource-constrained edge devices. Traditional deep neural networks (DNNs) are often too resource-intensive for edge deployment, leading to a need for models that optimize both memory footprint and computational efficiency. FedBNN directly learns binary representations during local training, encoding weights as single bits (+1, -1) rather than 32-bit floats, which significantly reduces the model size and runtime computational demands. The authors conduct extensive evaluations across multiple benchmark datasets, including FMNIST, SVHN, CIFAR-10, TinyImageNet, and FEMNIST, comparing FedBNN against state-of-the-art federated methods such as FedAvg, FedBAT, and FedMUD. The results demonstrate that FedBNN not only reduces resource consumption but also maintains performance comparable to existing federated methods using real-valued models, thereby enabling practical deployment of federated vision systems.
Methodology
FedBNN employs a federated learning strategy inspired by FedAvg, training a rotated binary neural network with binary weights while maintaining the same parameter count as its real-valued counterparts. The framework learns binary representations directly during local training, which enhances efficiency in terms of memory and computational resources.
Results
The evaluations reveal that FedBNN significantly reduces resource consumption, achieving similar performance levels to existing federated methods that utilize real-valued models. The analysis of runtime complexity indicates substantial advantages in computational cost and memory consumption.
Implications
The development of FedBNN has significant implications for deploying federated learning in resource-constrained environments, particularly in applications such as mobile devices and IoT systems where efficient inference is critical. This framework can facilitate privacy-preserving collaborative training while ensuring that models remain lightweight and efficient.
Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models
Federated Learning
- Introduces a method for exact federated continual unlearning using ridge regression heads on frozen foundation models.
- Develops a communication-efficient protocol that supports continual add/delete requests without retraining.
- Proves deterministic exactness and invariance properties of the proposed method.
- Demonstrates experimental validation across multiple benchmarks with high accuracy and low computational cost.
Read more
Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models
Summary
This paper addresses the challenge of exact federated continual unlearning in the context of frozen foundation models with ridge regression heads. The authors highlight the necessity for models to support the 'right to be forgotten' by enabling the removal of specific training samples without the need for complete retraining. Existing methods in federated unlearning often rely on approximate techniques, which can be costly and imprecise. The authors propose a novel protocol that leverages the analytic structure of ridge regression, allowing for efficient updates based on two sufficient statistics: the feature Gram matrix and the feature-label moment. This approach enables the server to maintain an exact model that is equivalent to centralized retraining after each add/delete request, using fixed-size messages for communication. The paper presents two server-side implementations that ensure numerical stability and efficiency. Experimental results demonstrate that both implementations achieve high accuracy, matching centralized retraining with minimal computational cost, thus providing a practical solution for federated continual unlearning.
Methodology
The authors formalize federated continual unlearning for frozen foundation models with ridge heads, deriving a protocol that maintains sufficient statistics for model updates. They present two server-side variants: one that solves a symmetric positive definite system for exact updates and another that uses Sherman-Morrison-Woodbury updates for incremental tracking. The method ensures that the model remains equivalent to centralized retraining after each request.
Results
The experimental results show that both server-side implementations match centralized ridge retraining to within 10^-9 relative Frobenius error across various benchmarks (CIFAR-10, CIFAR-100, FeMNIST, Sentiment140) and efficiently handle streams of deletion requests at significantly lower costs compared to existing federated learning baselines.
Implications
This work provides a robust framework for implementing federated learning systems that comply with privacy regulations, particularly in sensitive domains like healthcare and finance. The exact unlearning capability enhances user trust and data governance by allowing for precise data removal.
ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving
Robotics
Optimization
Theory
- ADV-0 is the first closed-loop training framework for long-tail problems in autonomous driving.
- It couples adversarial generation and policy optimization in an end-to-end manner.
- The framework employs a preference-based solution to the zero-sum game, ensuring stability and efficiency.
- Theoretically, ADV-0 guarantees convergence to a Nash Equilibrium and certified performance bounds.
Read more
ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving
Summary
The paper introduces ADV-0, a novel closed-loop min-max adversarial training framework aimed at enhancing the robustness of autonomous driving systems against long-tail scenarios, which are rare yet critical for safety. Traditional adversarial training methods often separate scenario generation from policy optimization, leading to misalignment of objectives and failure to adapt to evolving policies. ADV-0 addresses these issues by framing the interaction between the driving policy (defender) and the adversarial agent (attacker) as a zero-sum Markov game, where the attacker’s utility is directly aligned with the defender’s objectives. This alignment allows for the identification of an optimal adversary distribution. The authors propose an online iterative preference learning algorithm to facilitate dynamic adversary evolution, enabling the attacker to continuously adapt to the defender's shifting vulnerabilities. Theoretical guarantees are provided, showing that ADV-0 converges to a Nash Equilibrium and maximizes a certified lower bound on real-world performance. Empirical results demonstrate that ADV-0 effectively reveals diverse safety-critical failures and significantly improves the generalizability of learned policies and motion planners against unseen long-tail risks.
Methodology
The authors model the autonomous driving task as a Markov Decision Process (MDP) and utilize a closed-loop min-max optimization framework. They propose an online iterative preference learning algorithm that allows the adversary to adaptively evolve based on the defender's performance, thus maintaining alignment between the two players' objectives.
Results
The experiments demonstrate that ADV-0 effectively exposes a wide range of safety-critical failure scenarios and significantly enhances the generalizability of both learned policies and motion planners when faced with unseen long-tail risks.
Implications
The findings suggest that ADV-0 could be instrumental in developing more robust autonomous driving systems capable of handling rare but critical scenarios, thereby improving safety in real-world applications.
Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms
Reinforcement Learning
Interpretability
Optimization
- Introduces a visualization method for analyzing critic match loss landscapes in online RL.
- Constructs a three-dimensional loss surface and two-dimensional optimization path to characterize critic learning behavior.
- Demonstrates the method using ADHDP on cart-pole and spacecraft control tasks.
- Provides quantitative indices for structured comparison of training outcomes.
Read more
Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms
Summary
This paper addresses the challenges of interpreting online reinforcement learning (RL) algorithms, particularly those utilizing an actor-critic structure. The authors propose a novel method for visualizing the critic match loss landscape, which aids in understanding the optimization process of the critic neural network. By projecting recorded critic parameter trajectories onto a low-dimensional linear subspace, the method generates a three-dimensional loss surface and a two-dimensional optimization path that characterize the critic's learning behavior. The study introduces quantitative landscape indices and a normalized system performance index, facilitating structured comparisons across different training outcomes. The proposed framework is demonstrated using the Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm on cart-pole and spacecraft attitude control tasks. The results reveal distinct landscape characteristics associated with stable convergence and unstable learning, thus providing insights into the optimization behavior of the critic in online RL.
Methodology
The authors construct a critic match loss landscape by projecting critic parameter trajectories onto a low-dimensional subspace. They evaluate the critic match loss over a parameter grid using fixed reference state samples and temporal-difference targets, resulting in a visual representation of the optimization process.
Results
The proposed visualization method successfully illustrates the learning behavior of the critic network, highlighting differences in landscape characteristics that correlate with stable and unstable learning outcomes. The comparative analyses across various projection methods and training stages demonstrate the effectiveness of the approach in interpreting critic optimization.
Implications
This framework enhances the interpretability of online reinforcement learning algorithms, allowing researchers and practitioners to better understand the underlying mechanisms of critic optimization. It can be applied in dynamic control scenarios where system parameters frequently change, improving the robustness and adaptability of RL systems.
Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
Theory
- Introduction of the Budget-Sensitive Discovery Score (BSDS) as a formally verified evaluation metric.
- Discovery Quality Score (DQS) provides a single summary statistic that prevents budget cherry-picking.
- Case study shows that LLMs do not outperform existing machine learning models in drug discovery candidate selection.
- The framework is applicable to various domains beyond drug discovery, including safety triage and clinical trials.
Read more
Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
Summary
This paper addresses the lack of a principled, budget-aware evaluation framework for AI-guided scientific selection, particularly in drug discovery. The authors introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric that penalizes false discoveries and excessive abstention at each budget level. The BSDS is supported by 20 theorems verified using the Lean 4 proof assistant, ensuring its robustness. The paper also presents the Discovery Quality Score (DQS), a budget-averaged summary statistic that prevents inflation through cherry-picking budgets. The authors conduct a case study to evaluate the marginal value of large language models (LLMs) in drug discovery candidate selection, comparing 39 proposers, including mechanistic models and various LLM configurations. The findings reveal that a simple Random Forest-based proposer outperforms all LLM configurations, indicating that LLMs do not provide additional value over existing trained classifiers. The proposed framework is applicable to any domain involving budget-constrained candidate selection, such as materials screening and clinical trial site selection.
Methodology
The authors developed the BSDS metric, which incorporates penalties for false discoveries and excessive abstention, and verified its properties using the Lean 4 proof assistant. They evaluated 39 different proposers, including mechanistic models and LLM configurations, using SMILES representations on MoleculeNet datasets under various experimental conditions.
Results
The Random Forest-based Greedy-ML proposer achieved the best DQS score of -0.046, outperforming all LLM configurations. No LLM surpassed the Greedy-ML baseline in either zero-shot or few-shot evaluations. The proposer hierarchy was found to generalize across multiple MoleculeNet benchmarks.
Implications
The proposed BSDS framework can significantly improve the evaluation of AI-guided selection processes in scientific discovery, ensuring that budget constraints and error costs are appropriately considered. This has implications for various fields, including drug discovery, materials screening, and autonomous vehicle safety.
Overcoming the Modality Gap in Context-Aided Forecasting
Time Series
Multimodal
- Introduced a semi-synthetic methodology for generating verifiably useful contexts from time series datasets.
- Created CAF-7M, a large corpus of 7 million context-augmented time series windows with a verified test set.
- Demonstrated that the proposed methodology enables effective context utilization and real-world transfer.
- Showed that dataset quality is a primary bottleneck in context-aided forecasting, rather than model architecture.
Read more
Overcoming the Modality Gap in Context-Aided Forecasting
Summary
This paper addresses the challenges in context-aided forecasting (CAF), where multimodal models often underperform compared to unimodal models. The authors hypothesize that this performance gap is due to poor context quality in existing datasets. To tackle this issue, they propose a semi-synthetic data augmentation method that generates high-quality contexts that are both descriptive of temporal dynamics and complementary to numerical histories. This methodology allows for the creation of a large-scale dataset, CAF-7M, consisting of 7 million context-augmented time series windows, including a rigorously verified test set. The authors demonstrate that their approach enables effective context utilization and transfers well to real-world applications. Their results indicate that dataset quality is a more significant bottleneck in CAF than architectural limitations, suggesting that improving context quality can enhance forecasting performance significantly.
Methodology
The authors developed a two-phase context-generation methodology: the generation phase uses a Large Language Model (LLM) to create plausible scenarios explaining differences in time series dynamics, while the verification phase assesses the usefulness of generated contexts by checking if they improve forecasting accuracy. This process ensures that only contexts that enhance predictive performance are retained.
Results
The multimodal architecture, DoubleCast, trained on the CAF-7M dataset effectively leveraged context, achieving performance comparable to state-of-the-art models while being more cost-effective. It consistently outperformed its unimodal counterpart, Chronos, and demonstrated superior performance on the real-world ChatTime benchmark.
Implications
The findings suggest that enhancing the quality of contextual information in forecasting can lead to significant improvements in predictive accuracy. This has implications for various industries that rely on time series forecasting, such as finance, supply chain management, and operational planning.
Trust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models
Generative Models
Optimization
- Introduction of Trust-Region Search (TRS) for black-box optimization of noise samples in generative models.
- TRS achieves a balance between exploration and exploitation, enhancing adaptability to various tasks.
- Significant improvements in sample quality for text-to-image, molecule, and protein design tasks compared to existing methods.
- Minimal hyperparameter tuning required, making TRS versatile across different generative architectures.
Read more
Trust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models
Summary
This paper introduces a novel approach called Trust-Region Search (TRS) for optimizing noise samples in diffusion and flow models to align them with target rewards during inference. Traditional methods often rely on differentiable or computationally efficient reward models, which can be limiting. TRS treats the generative and reward models as black boxes, focusing solely on optimizing the source noise. This method strikes a balance between global exploration and local exploitation, making it adaptable to various generative settings with minimal hyperparameter tuning. The authors evaluate TRS across multiple tasks, including text-to-image generation, molecule design, and protein design, demonstrating that it significantly improves sample quality compared to existing methods. The results indicate that TRS can effectively handle expensive reward functions while maintaining high performance with limited computational resources.
Methodology
The TRS approach begins by exploring multiple seed noise samples, which are iteratively refined through local perturbations. The perturbations are adaptively controlled based on observed reward values, ensuring that the search remains within the data manifold, thus producing stable and high-quality samples. This method is inspired by Bayesian optimization techniques and is designed to work with black-box models without requiring internal modifications.
Results
The evaluation of TRS across various tasks showed that it produced significantly more aligned and higher-quality samples than existing search heuristics and full-noise sequence search baselines, all while operating under the same computational budgets. The results highlighted improvements in aesthetic scores, property alignment, and designability metrics across the evaluated domains.
Implications
The TRS method has the potential to enhance the performance of generative models in practical applications, such as high-fidelity image synthesis, drug discovery, and protein engineering. Its adaptability and efficiency could lead to broader adoption in scenarios where precise control over generated outputs is required without extensive retraining.
Scribe Verification in Chinese manuscripts using Siamese, Triplet, and Vision Transformer Neural Networks
Computer Vision
- Introduces a unified framework for scribe verification using deep learning models.
- Compares Siamese, Triplet, and Vision Transformer architectures for handwriting analysis.
- Demonstrates that the MobileNetV3+ Custom Siamese model achieves superior performance.
- Utilizes two diverse datasets to evaluate model effectiveness in scribe verification.
Read more
Scribe Verification in Chinese manuscripts using Siamese, Triplet, and Vision Transformer Neural Networks
Summary
This paper investigates deep learning models for the task of scribe verification in Chinese manuscripts, aiming to determine whether two manuscript fragments were authored by the same individual. The study utilizes two datasets: the Tsinghua Bamboo Slips Dataset, which contains digitized images of ancient Chinese manuscripts, and a curated subset of the Multi-Attribute Chinese Calligraphy Dataset (MCCD), focusing on calligraphers with a substantial number of samples. The authors implement and compare various neural architectures, including Siamese networks, Triplet networks, and Vision Transformers (ViTs), to learn discriminative embeddings of handwriting images. The experimental results indicate that the MobileNetV3+ Custom Siamese model, trained with contrastive loss, achieves the highest or second-highest accuracy and area under the Receiver Operating Characteristic Curve (ROC AUC) across both datasets. The paper establishes a unified PyTorch-based framework for training and evaluating scribe verification models, incorporating dynamic sampling mechanisms and standardized evaluation protocols. This work provides a comprehensive comparison of different architectures, offering insights into their performance and trade-offs in the context of metric learning for scribe verification.
Methodology
The authors implemented Siamese networks trained with contrastive loss, Triplet networks optimized with triplet loss, and Vision Transformer models to learn discriminative embeddings from handwriting images. They established a PyTorch-based training and evaluation framework that includes dynamic pair- and triplet-sampling mechanisms, consistent preprocessing, and ROC-based evaluation protocols.
Results
The MobileNetV3+ Custom Siamese model exhibited the best or second-best performance in terms of overall accuracy and ROC AUC on both the Tsinghua Bamboo Slips Dataset and the MCCD. The systematic evaluation of various architectures under identical conditions provided valuable insights into their effectiveness for scribe verification.
Implications
The findings of this study have significant implications for the fields of digital humanities and historical document analysis, enabling more accurate classification and authentication of ancient texts. The developed framework can be utilized for future research in scribe verification and could be adapted for other handwriting analysis tasks.
Representation Learning for Spatiotemporal Physical Systems
Theory
Time Series
Optimization
- Traditional machine learning methods for spatiotemporal systems focus on next-frame prediction, leading to high computational costs and error accumulation.
- The paper emphasizes the importance of downstream tasks, such as physical parameter estimation, as a measure of representation quality.
- Joint Embedding Predictive Architectures (JEPAs) outperform traditional pixel-level prediction methods in learning useful representations for scientific tasks.
- The study evaluates methods on three physical systems, demonstrating the effectiveness of latent space learning.
Read more
Representation Learning for Spatiotemporal Physical Systems
Summary
This paper addresses the limitations of traditional machine learning approaches in modeling spatiotemporal physical systems, which often focus on next-frame prediction and suffer from high computational costs and error compounding. The authors propose a shift in focus towards downstream scientific tasks, such as estimating governing physical parameters, which provide a clearer measure of the physical relevance of learned representations. They evaluate the effectiveness of self-supervised learning methods, particularly Joint Embedding Predictive Architectures (JEPAs), against traditional pixel-level prediction methods. The study reveals that JEPAs, which operate in a latent space rather than pixel space, consistently outperform pixel-based methods in terms of physical parameter estimation accuracy. The authors conduct experiments on three representative physical systems—active matter, shear flow, and Rayleigh-Bénard convection—to validate their findings, demonstrating that latent prediction models yield embeddings that better capture the underlying physical dynamics. This work highlights the potential of self-supervised learning paradigms in enhancing the understanding of complex physical systems through improved representation learning.
Methodology
The authors introduce and compare two self-supervised learning paradigms: Joint Embedding Predictive Architectures (JEPAs) and masked autoencoding. JEPAs focus on predicting representations in a latent space, while masked autoencoders aim to minimize pixel-level reconstruction error. The models are evaluated based on their ability to estimate physical parameters in three spatiotemporal systems governed by partial differential equations (PDEs).
Results
The results indicate that JEPAs consistently achieve lower prediction errors for physical parameters compared to traditional pixel-level methods. This suggests that embeddings learned through latent prediction models are more effective in capturing the essential dynamics of the physical systems under study.
Implications
The findings suggest that adopting self-supervised learning approaches, particularly those that leverage latent space representations, can enhance the modeling of complex physical systems. This has potential applications in various scientific fields, including physics, engineering, and environmental science, where accurate modeling of dynamic systems is crucial.
LaPro-DTA: Latent Dual-View Drug Representations and Salient Protein Feature Extraction for Generalizable Drug--Target Affinity Prediction
Graph Learning
- Introduces a latent dual-view drug representation to mitigate overfitting and enhance generalization.
- Employs a salient protein feature extraction strategy to improve the identification of relevant bioactive regions.
- Utilizes a cross-view multi-head attention mechanism for comprehensive interaction modeling.
- Achieves state-of-the-art performance on benchmark datasets, particularly in unseen-drug scenarios.
Read more
LaPro-DTA: Latent Dual-View Drug Representations and Salient Protein Feature Extraction for Generalizable Drug--Target Affinity Prediction
Summary
The paper presents LaPro-DTA, a novel framework aimed at improving drug-target affinity (DTA) prediction, particularly in cold-start scenarios where unseen drugs or targets are involved. Traditional methods often suffer from overfitting and information loss, leading to poor generalization. LaPro-DTA addresses these issues through a latent dual-view drug representation mechanism that combines an instance-level view for capturing fine-grained substructures with a distribution-level view for generalizing chemical scaffolds. This dual approach helps the model learn transferable structural rules instead of memorizing specific training samples. Additionally, the framework employs a salient protein feature extraction strategy using pattern-aware top-k pooling, which filters out irrelevant background noise and isolates significant bioactive regions. A cross-view multi-head attention mechanism is then utilized to fuse these refined features, enhancing the modeling of drug-target interactions. The authors conducted extensive experiments on benchmark datasets, demonstrating that LaPro-DTA significantly outperforms existing state-of-the-art methods, achieving an 8% reduction in mean squared error on the Davis dataset under challenging unseen-drug conditions, while also providing interpretable insights into binding mechanisms.
Methodology
LaPro-DTA consists of three main components: a latent dual-view drug representation mechanism that captures both instance-level and distribution-level features, a salient protein feature extraction strategy using pattern-aware top-k pooling to filter out irrelevant data, and a cross-view multi-head attention mechanism for integrating these features to predict drug-target affinity.
Results
The framework was evaluated on standard benchmark datasets, showing a significant improvement in performance, with an 8% reduction in mean squared error on the Davis dataset in unseen-drug settings, outperforming existing methods and providing enhanced interpretability.
Implications
LaPro-DTA has the potential to accelerate drug discovery by providing more reliable predictions in scenarios where traditional methods fail, particularly in identifying affinities for novel drugs and targets. Its interpretability can also aid researchers in understanding the underlying mechanisms of drug-target interactions.
RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
Large Language Models
Efficient ML
NLP
- RelayCaching enables efficient reuse of KV caches in multi-agent LLM systems.
- The method achieves over 80% KV cache reuse and reduces TTFT by up to 4.7 times.
- RelayCaching maintains accuracy comparable to full prefilling with minimal overhead.
- The approach systematically characterizes KV deviations, allowing targeted rectification.
Read more
RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
Summary
The paper introduces RelayCaching, a novel method aimed at optimizing the collaboration of multi-agent large language models (LLMs) by reusing key-value (KV) caches during the decoding phase. Traditional multi-agent systems face significant challenges due to redundant computations during the prefill phase, which arise from the cascading nature of outputs from one agent serving as inputs for another. This redundancy leads to increased memory usage and longer time-to-first-token (TTFT). Existing methods for optimizing KV cache reuse either compromise accuracy or have low reuse rates due to rigid constraints. RelayCaching addresses these issues by allowing the direct reuse of KV caches from previous agents, leveraging the observation that KV caches for identical content remain consistent across phases, with deviations being sparse and localized. The method employs a layer-range profiler to identify critical layers for rectification and a token selector to determine which tokens need adjustment. This approach maintains high accuracy while achieving over 80% KV cache reuse and reducing TTFT by up to 4.7 times compared to standard methods, all with minimal accuracy degradation. The experiments conducted across various collaborative LLM tasks demonstrate the effectiveness of RelayCaching in enhancing efficiency without sacrificing performance.
Methodology
RelayCaching is a training-free inference method that reuses decoding KV caches from previous agents during the prefill phase. It includes a layer-range profiler to identify critical layers for rectification and a token selector that combines deviation-based and influence-based selection to pinpoint tokens needing adjustment. This selective recomputation allows for high accuracy while minimizing computational overhead.
Results
The experiments show that RelayCaching achieves over 80% reuse of KV caches, significantly reduces TTFT by up to 4.7 times compared to standard pipelines, and maintains generation quality with negligible accuracy degradation across diverse collaborative LLM tasks.
Implications
RelayCaching has the potential to enhance the efficiency of multi-agent LLM systems, making them more scalable and practical for complex AI tasks. This could lead to faster response times and reduced resource consumption in applications such as software development, scientific discovery, and other collaborative workflows involving LLMs.
Beyond Attention: True Adaptive World Models via Spherical Kernel Operator
Theory
NLP
Reinforcement Learning
- Critiques the limitations of conventional world model approaches that rely on latent space projections.
- Introduces the Spherical Kernel Operator (SKO) as a replacement for traditional attention mechanisms.
- SKO utilizes localized ultraspherical polynomials to achieve direct function approximation without saturation.
- Empirical results show that SKO accelerates convergence and improves performance in autoregressive language modeling.
Read more
Beyond Attention: True Adaptive World Models via Spherical Kernel Operator
Summary
This paper critiques the conventional approach to world model-based AI, which relies on projecting high-dimensional observations into parameterized latent spaces. The author argues that this method is flawed as it merely shifts the manifold learning problem into the latent space, leading to inefficiencies when the underlying data distribution changes. The paper introduces the Spherical Kernel Operator (SKO), a new framework that replaces traditional attention mechanisms with a more robust 'inner world model.' SKO projects the data manifold onto a hypersphere and employs localized ultraspherical polynomials for direct integral reconstruction of target functions. This innovative approach circumvents the saturation phenomenon associated with positive kernel estimators, allowing for improved predictive capacity that is less affected by the curse of dimensionality. Empirical evaluations demonstrate that SKO significantly accelerates convergence and outperforms standard attention mechanisms in autoregressive language modeling tasks. Ultimately, the paper posits that SKO provides a mathematically sound basis for constructing true world models in AI, reflecting the cognitive processes of biological agents.
Methodology
The paper formulates the Spherical Kernel Operator (SKO) by projecting data onto a hypersphere and employing localized ultraspherical polynomials for integral reconstruction of target functions. This approach avoids the saturation phenomenon typical of positive kernel estimators, allowing for more effective learning of transition dynamics without bias from observation frequency.
Results
Empirical evaluations indicate that SKO significantly accelerates convergence rates and outperforms standard attention baselines in autoregressive language modeling tasks, demonstrating its effectiveness as a predictive mechanism in dynamic environments.
Implications
The development of SKO could lead to more adaptive and efficient AI systems capable of better understanding and interacting with complex environments, potentially enhancing applications in various fields such as robotics, reinforcement learning, and cognitive modeling.
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning
Computer Vision
Efficient ML
Optimization
- BoSS is the first scalable oracle strategy for batch active learning applicable to large datasets and complex deep neural networks.
- The method combines an ensemble of selection strategies with a performance-based selection approach.
- BoSS significantly outperforms existing oracle strategies under comparable computational constraints.
- Current state-of-the-art active learning strategies do not consistently achieve oracle performance, indicating room for improvement.
Read more
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning
Summary
The paper introduces BoSS (Best-of-Strategies Selector), a novel oracle strategy designed to enhance deep active learning (AL) by effectively selecting valuable instances for annotation. Traditional AL strategies often lack robustness across various models, datasets, and annotation budgets, leading to suboptimal performance. BoSS addresses these limitations by constructing a pool of candidate batches through an ensemble of selection strategies and selecting the batch that maximizes performance gain. This approach allows BoSS to scale effectively to large datasets and complex deep neural networks. The authors demonstrate that BoSS outperforms existing oracle strategies and highlights the performance gap between state-of-the-art AL strategies and oracle performance, particularly in large-scale datasets with multiple classes. The findings suggest that employing an ensemble-based approach can mitigate the inconsistent performance of AL strategies, paving the way for future developments in this area.
Methodology
BoSS constructs a diverse pool of candidate batches using an ensemble of selection strategies. It then evaluates these batches based on their expected performance improvement, selecting the one that yields the highest gain. To enhance efficiency, BoSS freezes the pretrained backbone of the model and only retrains the final layer during the selection process, allowing it to operate effectively in large-scale deep learning scenarios.
Results
The experiments conducted demonstrate that BoSS outperforms existing oracle strategies and reveals that current state-of-the-art AL strategies fall short of achieving oracle-level performance, especially in complex datasets. The results indicate that no single AL strategy consistently dominates across all cycles, suggesting the effectiveness of an ensemble-based approach.
Implications
The introduction of BoSS has significant implications for the field of active learning, particularly in scenarios where annotation costs are high and model performance is critical. By providing a scalable and effective oracle strategy, BoSS can facilitate more efficient data annotation processes and improve the overall performance of machine learning models in various applications.
From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code
NLP
Large Language Models
Interpretability
- LLMs are reframed as code generators rather than rule evaluators, enhancing interpretability and reducing costs.
- Automated statistical validation methods are introduced to filter low-quality rules without human intervention.
- A cluster-based gap analysis method is developed to identify and refine decision logic for underperforming founder subpopulations.
- The proposed framework achieves competitive performance on VCBench, outperforming existing LLMs while ensuring interpretability.
Read more
From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code
Summary
This paper addresses the challenges of using large language models (LLMs) for high-stakes decision-making, particularly in the context of venture capital founder screening. Traditional LLM approaches often rely on per-sample evaluations, leading to high costs and stochastic outputs. The authors propose a novel framework that treats LLMs as code generators, producing executable decision logic that is deterministic and interpretable. By generating Python predicates based on structured data attributes of founders, the framework eliminates the need for repeated LLM queries, significantly reducing costs and risks of hallucination. The authors also introduce automated statistical validation methods and a cluster-based gap analysis to refine decision logic iteratively. The framework is tested on VCBench, a benchmark dataset for founder success prediction, achieving notable performance metrics while maintaining full interpretability of the decision-making process.
Methodology
The authors prompt an LLM to generate Python predicates over structured founder attributes, validate these rules using precision lift, binomial significance testing, and coverage filtering. They also apply cluster-based gap analysis to iteratively refine the decision logic without requiring human annotation.
Results
On the VCBench dataset, the proposed framework achieved 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o, which achieved 30.0% precision and an F0.5 score of 25.7%. The results demonstrate that the approach maintains full interpretability while providing competitive predictive performance.
Implications
This framework has significant implications for decision-making in fields requiring high interpretability, such as venture capital, healthcare, and finance. It enables organizations to leverage LLMs for automated decision-making while ensuring transparency and reproducibility.
Sobolev--Ricci Curvature
Graph Learning
Theory
Efficient ML
- Introduction of Sobolev–Ricci Curvature (SRC) for graph structures.
- Efficient evaluation of SRC using a tree-metric structure.
- SRC recovers Ollivier–Ricci curvature on length-measure trees.
- SRC vanishes in the Dirac limit, aligning with zero-curvature cases.
Read more
Sobolev--Ricci Curvature
Summary
The paper introduces Sobolev–Ricci Curvature (SRC), a novel graph-based Ricci curvature derived from Sobolev transport geometry. SRC is designed to efficiently evaluate curvature on graphs, particularly in large-scale settings, by utilizing a tree-metric structure that simplifies computations compared to traditional methods like Ollivier–Ricci curvature (ORC). The authors establish two key consistency properties: SRC recovers ORC on trees with length measures and vanishes in the Dirac limit, aligning with the flat case of measure-theoretic Ricci curvature. The paper demonstrates SRC's utility through two applications: Sobolev–Ricci Flow (SRF), which updates edge lengths based on SRC, and SRC–MANL, which integrates SRC into a manifold-oriented edge pruning process. These applications highlight SRC's potential as a scalable and geometrically consistent tool for graph transformation and analysis.
Methodology
The authors develop SRC by leveraging Sobolev transport geometry, which allows for closed-form evaluations of curvature on tree structures. This approach circumvents the computational bottleneck associated with solving optimal transport problems for each edge, enabling efficient curvature calculations. The paper also explores the theoretical properties of SRC and its relationship with existing curvature measures.
Results
The paper establishes that SRC effectively captures the geometric properties of graphs while maintaining computational efficiency. The two proposed applications, Sobolev–Ricci Flow and SRC–MANL, demonstrate SRC's practical utility in curvature-driven graph transformations and edge pruning, showcasing its scalability and consistency with classical curvature measures.
Implications
SRC has significant implications for graph learning and analysis, particularly in scenarios requiring scalable methods for curvature evaluation. Its applications in graph reweighting and edge pruning can enhance the performance of various machine learning tasks involving graph-structured data.
Knowledge, Rules and Their Embeddings: Two Paths towards Neuro-Symbolic JEPA
Theory
Interpretability
Multimodal
- Introduction of a bidirectional neuro-symbolic framework (RiJEPA) that merges neural networks with symbolic logic.
- Utilization of Energy-Based Constraints (EBC) to shape the latent space and improve out-of-distribution generalization.
- Development of continuous rule discovery methods that bypass traditional combinatorial search limitations.
- Empirical success in achieving zero-shot logical accuracy in various applications, including clinical settings.
Read more
Knowledge, Rules and Their Embeddings: Two Paths towards Neuro-Symbolic JEPA
Summary
This paper addresses the limitations of modern self-supervised predictive architectures, which excel at capturing complex statistical correlations but struggle with integrating verifiable human logic. The authors propose a bidirectional neuro-symbolic framework called Rule-informed Joint-Embedding Predictive Architectures (RiJEPA) that combines the strengths of neural networks and rule-based systems. The first direction involves injecting structured inductive biases into JEPA training using Energy-Based Constraints (EBC) and a multi-modal dual-encoder architecture, reshaping the representation manifold to replace arbitrary correlations with logical basins. The second direction relaxes rigid symbolic rules into a continuous, differentiable logic, allowing for new paradigms in rule generation through gradient-guided Langevin diffusion. The empirical evaluations demonstrate the framework's effectiveness in both synthetic simulations and a clinical use case, achieving 100% zero-shot logical accuracy and providing transparent geometric justifications for predictions. This work establishes a foundation for robust, generative, and interpretable neuro-symbolic representation learning.
Methodology
The authors employ a multi-modal dual-encoder architecture to integrate structured inductive biases into the training of Joint-Embedding Predictive Architectures (JEPA). They introduce Energy-Based Constraints (EBC) to reshape the representation manifold and utilize gradient-guided Langevin diffusion for continuous rule discovery, allowing for generative inference and rule generation.
Results
The proposed RiJEPA framework demonstrated 100% zero-shot logical accuracy in empirical evaluations, effectively clustering its manifold during pretraining. The approach showed significant improvements in handling out-of-distribution data and provided interpretable predictions through geometric justification.
Implications
This framework has the potential to enhance various applications in AI by providing robust, interpretable models that can integrate human-like reasoning into machine learning systems. It could be particularly beneficial in fields requiring high levels of interpretability and reliability, such as healthcare and autonomous systems.
Sampling-guided exploration of active feature selection policies
Reinforcement Learning
Optimization
Efficient ML
- Introduces a heuristic-based strategy for exploring feature combinations in larger datasets.
- Implements a post-fit regularization strategy to reduce decision complexity.
- Demonstrates improved performance over existing methods in accuracy and policy complexity.
- Addresses the limitations of previous reinforcement learning approaches in feature selection.
Read more
Sampling-guided exploration of active feature selection policies
Summary
This paper addresses the challenge of selecting appropriate features for machine learning predictive models, particularly in scenarios where feature acquisition costs are a concern. The authors build upon their previous work that utilized a reinforcement learning approach to recommend which modality to acquire next, optimizing the information-to-cost ratio based on instance-specific information. They reformulate the problem as a Markov Decision Process (MDP) to handle the evolving dimensionality of acquired features without relying on data imputation. The current work expands this framework to accommodate larger datasets by implementing a heuristic-based strategy that prioritizes promising feature combinations. Additionally, a post-fit regularization strategy is introduced to minimize the number of feature combinations, resulting in more compact decision sequences. The proposed method is evaluated on four binary classification datasets, including one with high-dimensional variables, demonstrating superior performance compared to state-of-the-art methods in terms of both accuracy and policy complexity.
Methodology
The authors formulated the feature selection problem as a Markov Decision Process (MDP) to manage the dynamic nature of acquired features. They employed reinforcement learning to derive optimal acquisition policies, focusing on maximizing the information-to-cost ratio. A heuristic-based strategy was developed to efficiently explore feature combinations, and a post-fit regularization technique was introduced to streamline decision-making.
Results
The proposed method outperformed state-of-the-art feature selection techniques across four binary classification datasets, achieving higher accuracy and reduced policy complexity. The largest dataset tested contained 56 features and 4500 samples, showcasing the method's scalability and effectiveness in handling high-dimensional data.
Implications
The findings suggest that the proposed approach can significantly enhance feature selection processes in machine learning, particularly in fields like medical diagnosis where feature acquisition costs are critical. This could lead to more efficient and cost-effective data acquisition strategies, ultimately improving predictive model performance.
Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization
Reinforcement Learning
Optimization
Theory
- Introduces a Fisher-geometric decomposition of PPO updates into signal and waste, explaining the optimization-depth dilemma.
- Proposes CAPO, which aggregates multiple PPO replicates to improve policy optimization by focusing on width rather than depth.
- Demonstrates that consensus in natural parameter space leads to better performance and compliance than traditional averaging methods.
- Empirical results show CAPO outperforms PPO and deeper baselines by significant margins across continuous control tasks.
Read more
Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization
Summary
This paper addresses the limitations of Proximal Policy Optimization (PPO) in reinforcement learning, particularly the issue of path-dependent noise that arises from multiple epochs of clipped stochastic gradient descent (SGD). The authors utilize Fisher information geometry to decompose policy updates into two components: signal, which represents the natural gradient projection, and waste, which is the Fisher-orthogonal residual that does not contribute to first-order surrogate improvement. The study reveals that while the signal saturates quickly, the waste accumulates with additional epochs, leading to diminishing returns in optimization depth. To counteract this, the authors propose a new algorithm called Consensus Aggregation for Policy Optimization (CAPO), which emphasizes optimizing wider rather than deeper. CAPO runs K independent PPO replicates on the same batch, differing only in minibatch shuffling, and aggregates the results into a consensus. The paper explores aggregation in both Euclidean and natural parameter spaces, demonstrating that the consensus in the natural parameter space achieves superior KL-penalized surrogate performance and tighter trust region compliance compared to the mean expert. Empirical results show that CAPO significantly outperforms traditional PPO and deeper baselines across various continuous control tasks, achieving improvements of up to 8.6 times under fixed sample budgets.
Methodology
The authors employ a Fisher information geometric approach to analyze PPO updates, decomposing them into signal and waste components. They introduce the CAPO algorithm, which runs multiple independent PPO optimizations on the same batch and aggregates the results in both Euclidean and natural parameter spaces. The performance of CAPO is validated through experiments on continuous control tasks using the Gymnasium framework.
Results
CAPO outperforms traditional PPO and deeper baselines by up to 8.6 times across five out of six continuous control tasks under fixed sample budgets. The method also demonstrates a reduction in waste by 2-17% across all tasks, with the natural parameter space aggregation achieving a 46% waste reduction in high-dimensional tasks like Humanoid.
Implications
The findings suggest that optimizing policy updates by focusing on width rather than depth can lead to more efficient reinforcement learning algorithms. This approach could be applied to various domains in reinforcement learning, particularly in continuous control tasks, enhancing the performance of agents without requiring additional environment interactions.
Generalization and Memorization in Rectified Flow
Generative Models
Theory
Efficient ML
- Introduction of three test statistics for Membership Inference Attacks tailored for Rectified Flow models.
- Significant performance improvements in MIA metrics through complexity calibration.
- Discovery of a temporal pattern in memorization dynamics, with peak susceptibility at the midpoint of integration.
- Proposed substitution of uniform sampling with a Symmetric Exponential distribution to reduce memorization risks.
Read more
Generalization and Memorization in Rectified Flow
Summary
This paper investigates the memorization behaviors of Rectified Flow (RF) models, which are prominent in generative modeling for image synthesis. While existing research focuses on generation quality, the authors highlight the need to understand how RF models memorize training data. They introduce three test statistics for Membership Inference Attacks (MIA), culminating in a complexity-calibrated metric (Tmc_cal) that effectively distinguishes between image spatial complexity and genuine memorization signals. This calibration leads to significant improvements in attack performance metrics, including a 15% increase in AUC and a 45% boost in the privacy-critical TPR@1%FPR. The authors also reveal a unique temporal pattern in memorization dynamics, showing that susceptibility to MIA peaks at the midpoint of integration during standard uniform temporal training. To mitigate memorization risks, they propose replacing uniform timestep sampling with a Symmetric Exponential distribution, which effectively reduces exposure to vulnerable intermediate timesteps while maintaining generative fidelity. The findings are validated across three datasets: CIFAR-10, SVHN, and TinyImageNet.
Methodology
The authors developed three test statistics (Tnaive, Tmc, Tmc_cal) to assess the memorization risk of RF models through Membership Inference Attacks. They mathematically justified the peak susceptibility to MIA at the midpoint of integration and proposed a novel sampling strategy to mitigate memorization risks. Experiments were conducted on CIFAR-10, SVHN, and TinyImageNet datasets to validate their findings.
Results
The proposed complexity-calibrated metric (Tmc_cal) improved MIA performance metrics significantly, with an increase of up to 15% in AUC and 45% in TPR@1%FPR. The analysis revealed that the model's memorization risk peaks at the midpoint of the integration process, and the new sampling strategy effectively reduced this risk while preserving the quality of generated images.
Implications
The findings have important implications for the design of generative models, particularly in addressing privacy concerns related to data memorization. The proposed methodologies can be applied to enhance the robustness of generative models against membership inference attacks, making them more secure for practical applications.
RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks
Computer Vision
Efficient ML
Robotics
- Introduces a unified framework (RESQ) for enhancing both fault and attack resilience in quantized DNNs.
- Demonstrates significant improvements in resilience metrics without sacrificing accuracy.
- Reveals the asymmetric relationship between fault resilience and adversarial robustness.
- Validates the framework across multiple architectures and datasets, showcasing its general applicability.
Read more
RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks
Summary
This paper introduces RESQ, a unified three-stage framework aimed at enhancing the reliability and security of quantized deep neural networks (DNNs). The framework addresses two critical vulnerabilities: adversarial attacks and hardware-induced faults, which are particularly concerning in safety-critical applications. The first stage of RESQ focuses on improving attack resilience through fine-tuning that reduces sensitivity to input perturbations. The second stage enhances fault resilience by employing fault-aware fine-tuning under simulated bit-flip conditions. Finally, a lightweight post-training adjustment integrates quantization to improve efficiency while further mitigating fault sensitivity without compromising attack resilience. The authors conducted experiments on various architectures, including ResNet18, VGG16, EfficientNet, and Swin-Tiny, across datasets such as CIFAR-10, CIFAR-100, and GTSRB. The results demonstrate significant improvements in both attack resilience (up to 10.35%) and fault resilience (up to 12.47%) while maintaining competitive accuracy in quantized networks. Notably, the study reveals an asymmetric interaction between fault and attack resilience, indicating that while enhancing fault resilience can improve adversarial robustness, the reverse is not necessarily true.
Methodology
The RESQ framework consists of three sequential stages: (1) fine-tuning for attack resilience using Bit Plane Feature Consistency, (2) fault-aware fine-tuning to enhance fault resilience, and (3) a post-training adjustment to integrate quantization while maintaining efficiency and resilience.
Results
The experiments showed consistent improvements in attack resilience by up to 10.35% and fault resilience by up to 12.47% across various DNN architectures and datasets, while preserving competitive accuracy in quantized models.
Implications
The findings suggest that RESQ can be effectively utilized in safety-critical applications such as autonomous vehicles and industrial robotics, where both reliability and security are paramount. The framework's ability to enhance resilience against both adversarial attacks and hardware faults makes it a valuable contribution to the field of dependable AI.
Chunk-Guided Q-Learning
Reinforcement Learning
- CGQ mitigates TD error accumulation by regularizing a single-step critic with a chunk-based critic.
- Theoretical results show that CGQ achieves tighter critic optimality bounds than existing methods.
- Empirical evaluations indicate CGQ outperforms both single-step and action-chunked TD methods on long-horizon tasks.
- CGQ retains fine-grained value propagation while providing stability through chunk-based backups.
Read more
Chunk-Guided Q-Learning
Summary
The paper introduces Chunk-Guided Q-Learning (CGQ), a novel approach in offline reinforcement learning (RL) that addresses the challenges of bootstrapping error accumulation in single-step temporal-difference (TD) learning. Traditional single-step TD methods can suffer from compounding errors, especially in long-horizon tasks with sparse rewards. Action-chunked TD methods mitigate this issue by utilizing multiple-step backups but can lead to suboptimal policies due to their open-loop execution assumption. CGQ combines the benefits of both approaches by guiding a fine-grained single-step critic with a chunk-based critic trained through temporally extended backups. This regularization reduces compounding errors while maintaining fine-grained value propagation. The authors provide theoretical guarantees that CGQ achieves tighter critic optimality bounds compared to single-step or action-chunked TD methods alone. Empirical results demonstrate that CGQ outperforms both traditional single-step and action-chunked methods on long-horizon OG-Bench tasks, showcasing its effectiveness in improving performance in offline RL settings.
Methodology
The methodology involves augmenting standard single-step TD learning with a regularization mechanism that guides the critic towards a chunk-based critic. This is achieved by training the chunk-based critic using temporally extended backups and integrating its insights into the single-step critic's updates. The paper also discusses the theoretical foundations of CGQ, establishing its optimality bounds and providing an intuitive understanding of its advantages over traditional methods.
Results
CGQ demonstrated superior performance on long-horizon OG-Bench tasks, consistently outperforming both single-step and action-chunked TD methods. The empirical results validate the theoretical claims regarding the tighter optimality bounds and the effectiveness of the proposed regularization approach.
Implications
The implications of this work suggest that CGQ can be a valuable method for improving offline RL performance, particularly in scenarios with long-horizon tasks and sparse rewards. Its ability to balance stability and fine-grained learning could lead to advancements in various applications of reinforcement learning, including robotics and decision-making systems.
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training
Optimization
Efficient ML
- ZO-SAM integrates zero-order optimization into SAM to improve sparse training efficiency.
- The method reduces computational costs by halving the backpropagation requirements.
- ZO-SAM stabilizes training and accelerates convergence by reducing gradient variance.
- Models trained with ZO-SAM show improved robustness under distribution shifts.
Read more
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training
Summary
The paper introduces Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework designed to enhance the efficiency of sparse training in deep learning models. Sparse neural networks are advantageous due to their reduced computational costs and memory requirements, making them suitable for resource-constrained environments. However, existing sparse training methods often suffer from chaotic and noisy gradient signals, which hinder convergence and generalization, especially at high sparsity levels. ZO-SAM addresses these challenges by integrating zero-order optimization into the Sharpness-Aware Minimization (SAM) approach. This integration allows ZO-SAM to utilize zero-order gradient estimations during perturbation while maintaining first-order gradients for parameter updates, effectively halving the backpropagation computational cost compared to traditional SAM. The method stabilizes the training process, reduces gradient variance, and accelerates convergence, particularly in sparse training scenarios. Experimental results demonstrate that models trained with ZO-SAM not only achieve improved accuracy but also exhibit enhanced robustness under distribution shifts, making it a practical solution for real-world applications.
Methodology
The authors propose ZO-SAM, which employs zero-order gradient estimations during the perturbation step of SAM while retaining first-order gradients for subsequent updates. This selective integration allows for efficient optimization without the high computational overhead associated with traditional SAM.
Results
The experiments reveal that ZO-SAM significantly decreases gradient variance compared to traditional SGD, especially at high sparsity levels (90% and 95%). It surpasses existing sparse training techniques in terms of accuracy and achieves performance comparable to SAM while reducing computational demands.
Implications
ZO-SAM has the potential to make sparse neural networks more practical for deployment in resource-constrained environments, such as edge devices and mobile applications. Its ability to stabilize training and improve generalization could lead to broader adoption of sparse models in real-world applications.
3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting
Generative Models
Time Series
Optimization
- 3DTCR combines physical constraints with generative AI for improved TC intensity forecasting.
- The framework utilizes conditional Flow Matching and two-stage transfer learning for vortex-following reconstruction.
- 3DTCR significantly outperforms existing high-resolution forecasting systems in TC intensity prediction.
- The model reduces RMSE of maximum wind speed by 36.5% compared to traditional inputs.
Read more
3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting
Summary
The paper presents 3DTCR, a novel physics-based generative framework aimed at enhancing tropical cyclone (TC) intensity forecasting through improved three-dimensional (3D) structure reconstruction. Current forecasting methods, including numerical and AI-based models, struggle to accurately represent the complex inner-core dynamics of TCs, particularly under extreme conditions. While high-resolution simulations can capture these features, they are computationally expensive and impractical for operational use. The 3DTCR framework integrates physical constraints with generative AI techniques, specifically using conditional Flow Matching (CFM) optimized via latent domain adaptation and two-stage transfer learning. This approach allows for region-adaptive vortex-following reconstruction, addressing the limitations of low-resolution targets and over-smoothed forecasts. The framework was trained on a six-year dataset with 3-km resolution, demonstrating significant improvements in the representation of TC inner-core structures and intensity. Results indicate that 3DTCR outperforms the ECMWF high-resolution forecasting system in predicting TC intensity across various lead times, achieving a 36.5% reduction in RMSE of maximum wind speed at 10 meters compared to its FuXi inputs. These findings suggest that 3DTCR offers a promising and efficient method for resolving fine-scale TC structures at a lower computational cost, potentially transforming operational TC intensity forecasting.
Methodology
The 3DTCR framework employs conditional Flow Matching (CFM) for vortex-following reconstruction, enhanced through latent domain adaptation and two-stage transfer learning. It is trained on a high-resolution dataset derived from a six-year moving-domain WRF simulation, allowing for adaptive reconstruction of TC structures.
Results
3DTCR demonstrates superior performance in TC intensity forecasting, outperforming the ECMWF high-resolution forecasting system at nearly all lead times up to 5 days. It achieves a 36.5% reduction in RMSE of maximum wind speed at 10 meters compared to its FuXi inputs, indicating a significant improvement in forecasting accuracy.
Implications
The 3DTCR framework has the potential to revolutionize operational TC intensity forecasting by providing more accurate representations of TC inner-core structures, which are crucial for disaster mitigation and public safety. Its efficiency may also facilitate broader applications in meteorological modeling and climate research.
IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring
NLP
Large Language Models
Efficient ML
- IGU-LoRA adapts rank allocation based on layer importance, improving upon static rank methods.
- The use of Integrated Gradients allows for more stable and globally informed importance estimates.
- An uncertainty-aware scoring mechanism enhances the robustness of rank selection.
- The method shows superior performance across multiple NLP tasks and architectures.
Read more
IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring
Summary
The paper introduces IGU-LoRA, an innovative approach to parameter-efficient fine-tuning (PEFT) for large language models (LLMs). Traditional methods like Low-Rank Adaptation (LoRA) enforce a uniform rank across layers, which does not account for the varying importance of different layers. IGU-LoRA addresses this by employing Integrated Gradients (IG) to compute within-layer sensitivities, which are then aggregated into layer-level scores for adaptive rank allocation. Additionally, it incorporates an uncertainty-aware mechanism that utilizes exponential moving averages to stabilize updates and improve rank selection. The authors provide theoretical guarantees regarding the error bounds of their method and demonstrate its effectiveness across various tasks and architectures. The results indicate that IGU-LoRA consistently outperforms existing PEFT methods while maintaining comparable memory usage and decoding latency, thereby enhancing downstream accuracy and robustness.
Methodology
IGU-LoRA computes within-layer Integrated Gradients to assess parameter importance, aggregates these into layer-level scores for rank allocation, and applies an uncertainty-aware scheme using exponential moving averages to mitigate noise in updates. Theoretical analysis includes error bounds for the IG estimator and stability guarantees for the scoring mechanism.
Results
IGU-LoRA outperformed strong PEFT baselines such as LoRA, AdaLoRA, and DoRA across various datasets (e.g., BoolQ, GSM8K, GLUE) and model architectures (e.g., RoBERTa-large, Llama-2-7B). It achieved improved accuracy while matching the memory footprint and decoding latency of these methods.
Implications
The findings suggest that IGU-LoRA can significantly enhance the efficiency and effectiveness of fine-tuning large language models, making it a valuable tool for adapting LLMs to specific tasks without incurring high computational costs. This could lead to broader applications of LLMs in real-world scenarios where resource constraints are a concern.
Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference
NLP
Large Language Models
Efficient ML
- OATS improves tool selection accuracy without increasing latency or resource costs.
- The method interpolates tool embeddings based on historical success, enhancing performance.
- Two learned extensions were evaluated, with mixed results depending on data density.
- The approach maintains a strict latency budget suitable for high-throughput environments.
Read more
Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference
Summary
The paper introduces Outcome-Aware Tool Selection (OATS), a novel approach for semantic routers in large language model (LLM) inference gateways. Semantic routers are essential for selecting tools in the request path, where latency is critical. OATS enhances tool selection by interpolating tool embeddings towards the centroid of successful queries, which is done offline, ensuring no additional parameters, latency, or GPU costs during serving. The method was evaluated on two datasets: MetaTool and ToolBench, showing significant improvements in NDCG@5 scores. The authors also explored two learned extensions: a small MLP re-ranker and a contrastive adapter, finding that the MLP re-ranker may underperform when outcome data is sparse. The study emphasizes starting with the zero-cost OATS refinement and adding learned components only when data density permits, all while maintaining a single-digit millisecond CPU budget for tool selection.
Methodology
The OATS method involves offline interpolation of tool embeddings based on historical query outcomes, refining embeddings towards successful query centroids. Two extensions were tested: a small MLP re-ranker that uses outcome-derived features and a contrastive adapter that reshapes the embedding space through hard-negative contrastive learning.
Results
On the MetaTool dataset, OATS improved NDCG@5 from 0.869 to 0.940, and on ToolBench, it increased from 0.834 to 0.848. The MLP re-ranker showed performance that matched or hurt the baseline under sparse outcome data conditions, while the contrastive adapter provided comparable gains.
Implications
The findings suggest that OATS can significantly enhance tool selection in semantic routers, making it a viable solution for applications requiring low-latency decision-making in LLM inference. The strategic integration of learned components based on data density can optimize performance further.
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise
Optimization
Theory
- First uniform-in-time high-probability bound for SGD under the PŁ condition with Markovian noise.
- Allows noise magnitude to grow with function value, enabling analysis of practical sampling strategies.
- Establishes decay rates for both high-probability and expected suboptimality gaps.
- Introduces a novel proof technique using the Poisson equation and probabilistic induction.
Read more
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise
Summary
This paper presents the first uniform-in-time high-probability bounds for Stochastic Gradient Descent (SGD) under the Polyak-Łojasiewicz (PŁ) condition, specifically when the gradient noise includes both Markovian and martingale difference components. This work broadens the applicability of finite-time guarantees in SGD, which is crucial for various machine learning and deep learning models. The authors allow the noise magnitude to increase with the function value, facilitating the analysis of practical sampling strategies. They establish a high-probability guarantee for the suboptimality gap, demonstrating that it decays as O(tmix log(k/δ)/k) with a probability of at least 1 - δ. The authors also derive a matching expected decay rate of O(tmix/k) for the suboptimality. The proof technique employs the Poisson equation to manage Markovian noise and a probabilistic induction argument to address the absence of almost-sure bounds. The framework is applied to three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.
Methodology
The authors utilize a combination of probabilistic techniques, including the Poisson equation to handle Markovian noise and a probabilistic induction argument to derive uniform-in-time high-probability bounds. They analyze the SGD iterates under the PŁ condition while allowing for a general ABC condition that accommodates noise growth with the loss function.
Results
The paper establishes a high-probability bound on the suboptimality gap of SGD iterates, showing a decay rate of O(tmix log(k/δ)/k) with a probability of at least 1 - δ. Additionally, it provides a matching expected decay rate of O(tmix/k) for all k ≥ 0.
Implications
The findings have significant implications for optimization in machine learning, particularly in scenarios involving decentralized optimization, privacy-preserving learning, and reinforcement learning. The results can enhance the theoretical understanding and practical application of SGD in various complex settings.
Multifidelity Surrogate Modeling of Depressurized Loss of Forced Cooling in High-temperature Gas Reactors
Optimization
Efficient ML
Theory
- Multifidelity surrogate models can effectively reduce computational costs in nuclear reactor transient analysis.
- Models trained on dominant input variables outperform those using the full set of inputs.
- Two-fidelity configurations often yield better performance than three-fidelity setups at similar computational costs.
- Multifidelity Gaussian processes demonstrated the best overall performance among the methods evaluated.
Read more
Multifidelity Surrogate Modeling of Depressurized Loss of Forced Cooling in High-temperature Gas Reactors
Summary
This paper presents a study on the use of multifidelity surrogate models to predict the time to onset of natural circulation (ONC) and the temperature after ONC during depressurized loss of forced cooling (DLOFC) transients in high-temperature gas reactors (HTGRs). High-fidelity computational fluid dynamics (CFD) simulations are typically used for such analyses but are computationally expensive, especially when exploring large parameter spaces. The authors developed a CFD model in Ansys Fluent to generate 1000 simulation samples at varying fidelity levels, systematically coarsening the high-fidelity mesh to create low and medium-fidelity datasets. Several machine learning approaches, including multifidelity Gaussian processes and various neural network architectures, were evaluated. The study found that models trained on dominant inputs identified through sensitivity analysis outperformed those trained on the full input set. Additionally, two-fidelity configurations generally matched or exceeded the performance of three-fidelity counterparts at equivalent computational costs. Among the methods tested, multifidelity Gaussian processes showed the most robust performance, achieving excellent prediction metrics for both ONC timing and temperature, while neural networks provided comparable accuracy with lower training times. The findings indicate that multifidelity surrogate models can significantly reduce the computational burden of reactor transient analysis while maintaining accuracy in predicting critical safety metrics.
Methodology
The authors developed a CFD model to generate simulation samples at different fidelity levels. They evaluated various multifidelity machine learning methods, including Gaussian processes and neural networks, and validated these models on analytical benchmark functions before applying them to the ONC dataset. Sensitivity analysis was conducted to identify dominant input variables that significantly impact ONC timing.
Results
The study revealed that models utilizing dominant inputs consistently outperformed those using all inputs. Two-fidelity models showed comparable or superior performance to three-fidelity models at the same computational cost. The multifidelity Gaussian process approach provided the most robust predictions for both ONC timing and temperature, while neural networks offered similar accuracy with lower training times.
Implications
The findings suggest that multifidelity surrogate modeling can enhance the efficiency of safety analyses in nuclear reactor operations, enabling more extensive sensitivity analyses and design optimizations without the prohibitive computational costs associated with high-fidelity simulations.
Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System
Theory
- Proposes a fairness evaluation framework based on model error rates rather than approval outcomes.
- Utilizes a large dataset of human-reviewed prior authorization cases to assess demographic consistency.
- Demonstrates that model error rates are consistent across most demographics, with some limitations in subgroup analysis.
- Highlights the complexity of prior authorization processes and the need for rigorous fairness assessments.
Read more
Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System
Summary
The paper addresses the challenges of evaluating fairness in automated prior authorization (PA) systems, which are increasingly used in healthcare to determine medical necessity for insurance coverage. Traditional fairness metrics based on approval rates are deemed inappropriate due to legitimate clinical variations across demographic groups. Instead, the authors propose a novel fairness evaluation framework that focuses on model error rates, assessing how consistently the automated system performs across different demographics. The study utilizes a dataset of 7,166 human-reviewed cases across 27 medical necessity guidelines, employing a combination of error-rate comparisons, tolerance-band analysis, statistical power evaluation, and logistic regression. The findings indicate that model error rates are consistent across most demographic groups, with confidence intervals remaining within a predefined tolerance band. However, for race/ethnicity, the results are inconclusive due to limited subgroup sample sizes. This work represents a significant step towards aligning fairness evaluations in healthcare AI with regulatory standards and clinical realities.
Methodology
The authors employed a multi-layered statistical methodology that included error-rate comparisons, tolerance-band analysis with a ±5 percentage-point margin, statistical power evaluation, and logistic regression to assess model performance across demographic groups.
Results
The evaluation revealed consistent model error rates across most demographic groups, with confidence intervals falling within the predefined tolerance band. However, for race/ethnicity, the analysis showed limited sample sizes leading to inconclusive evidence, indicating the need for further investigation in this area.
Implications
This framework can guide the development and evaluation of automated decision-support systems in healthcare, ensuring that they operate fairly across diverse populations. It also sets a precedent for regulatory accountability in AI applications within administrative healthcare processes.
Ultra-Early Prediction of Tipping Points: Integrating Dynamical Measures with Reservoir Computing
Time Series
- Introduces a novel framework (RCDyM) for predicting tipping points in complex dynamical systems.
- Integrates reservoir computing with dynamical measures to analyze time series data without requiring system parameters.
- Demonstrates ultra-early prediction capabilities by extrapolating trends in dynamical measures.
- Validated through rigorous theoretical analysis and extensive numerical evaluations on synthetic and real-world datasets.
Read more
Ultra-Early Prediction of Tipping Points: Integrating Dynamical Measures with Reservoir Computing
Summary
This paper addresses the challenge of predicting tipping points in complex dynamical systems (CDSs), which can experience sudden and irreversible regime changes due to minor environmental shifts. The authors propose a model-free framework called the RC-based dynamical measure (RCDyM) method, which integrates reservoir computing (RC) with dynamical measures to analyze observational time series data. The framework operates in two stages: first, it employs RC to learn local dynamics from segmented observational data; second, it detects early warning signals of tipping points by analyzing the learned dynamics through key dynamical measures such as the dominant eigenvalue of the Jacobian matrix and the maximum Lyapunov exponent. The authors demonstrate that when these measures show trend-like patterns, they can be extrapolated to predict tipping points significantly in advance. The method is rigorously analyzed and validated through extensive numerical evaluations on synthetic systems and real-world datasets, including the Atlantic Meridional Overturning Circulation. Results indicate that the RCDyM method outperforms existing baseline methods in terms of dynamical interpretability, prediction stability, robustness, and ultra-early prediction capability.
Methodology
The RCDyM method consists of two main stages: first, it utilizes reservoir computing to learn local dynamics from observational data segmented into windows. Second, it analyzes the learned dynamics using dynamical measures to detect early warning signals of tipping points. The method does not require prior knowledge of system parameters, making it applicable to a wide range of real-world scenarios.
Results
The experimental results show that the RCDyM method significantly outperforms baseline methods in predicting tipping points across various scenarios, including equilibria, periodic cycles, and chaotic dynamics. It successfully predicts critical transitions and bifurcations in real-world complex dynamical systems, demonstrating its broad applicability and practical relevance.
Implications
The proposed framework has significant implications for fields such as climate science, ecology, and economics, where early prediction of tipping points can inform decision-making and risk management. Its ability to operate without detailed system parameters enhances its applicability to real-world scenarios.
Not All Latent Spaces Are Flat: Hyperbolic Concept Control
Generative Models
Computer Vision
Multimodal
- Introduction of Hyperbolic Control (HyCon) for T2I models to enhance concept manipulation.
- Utilization of hyperbolic geometry to achieve smoother and more predictable semantic transitions.
- Integration with existing generative models via a lightweight adapter without the need for retraining.
- Demonstration of state-of-the-art results across multiple safety benchmarks and T2I backbones.
Read more
Not All Latent Spaces Are Flat: Hyperbolic Concept Control
Summary
This paper addresses the challenge of controlling text-to-image (T2I) models to prevent the generation of unsafe content. Traditional methods utilize Euclidean adjustments to manipulate text embeddings, which can lead to unpredictable outcomes. The authors propose a novel framework called Hyperbolic Control (HyCon), which employs hyperbolic geometry to enable more stable and expressive manipulation of concepts within a semantically aligned hyperbolic representation space. HyCon integrates with existing generative models through a lightweight adapter, allowing for smooth semantic transitions and hierarchical organization of concepts. The framework is evaluated across multiple diffusion model backbones and demonstrates superior performance in safety benchmarks, highlighting its effectiveness in both retrieval and generative tasks. The findings suggest that hyperbolic steering offers a practical solution for enhancing the reliability of T2I generation.
Methodology
The authors developed HyCon, a hyperbolic concept control framework that operates in a hyperbolic embedding space. This framework leverages a state-of-the-art hyperbolic text encoder and employs parallel transport for concept manipulation. A lightweight logarithmic adapter is used to connect hyperbolic embeddings to the conditioning spaces of pretrained diffusion models, allowing for effective control without retraining.
Results
HyCon achieved state-of-the-art results across four safety benchmarks and demonstrated superior performance in both retrieval and generative tasks when compared to traditional Euclidean-based steering methods. The results indicate that hyperbolic steering provides more reliable and interpretable control over the generation process.
Implications
The findings suggest that hyperbolic geometry can significantly improve the safety and reliability of generative models, particularly in applications where content generation needs to be carefully controlled. This approach could be beneficial in various domains, including creative industries, content moderation, and any field requiring nuanced control over generated outputs.
MR-GNF: Multi-Resolution Graph Neural Forecasting on Ellipsoidal Meshes for Efficient Regional Weather Prediction
Graph Learning
Time Series
Efficient ML
- Introduction of MR-GNF, a lightweight and efficient model for regional weather forecasting.
- Utilization of a tri-band ellipsoidal mesh for boundary-free cross-scale coupling.
- Implementation of an axial graph-attention network for implicit 3D coupling.
- Achieves competitive forecasting skill with significantly lower computational costs.
Read more
MR-GNF: Multi-Resolution Graph Neural Forecasting on Ellipsoidal Meshes for Efficient Regional Weather Prediction
Summary
The paper introduces Multi-Resolution Graph Neural Forecasting (MR-GNF), a novel framework designed for efficient short-term regional weather prediction using a multi-scale graph representation of the Earth. Traditional numerical weather prediction (NWP) methods are computationally expensive, particularly for frequent updates in high-resolution regions. MR-GNF addresses this challenge by employing a tri-band ellipsoidal mesh that integrates a 0.25° region of interest with a 0.5° context belt and a 1.0° outer domain, allowing for continuous cross-scale message passing without the need for explicit nested boundaries. The model utilizes an axial graph-attention network that combines vertical self-attention across pressure levels with horizontal graph attention across surface nodes, achieving implicit 3D structure representation with only 1.6 million parameters. Trained on 40 years of ERA5 reanalysis data, MR-GNF demonstrates the ability to produce stable forecasts for near-surface temperature, wind, and precipitation over the UK-Ireland sector, achieving comparable performance to heavier regional AI systems while maintaining physical consistency across scales. The model's training requires less than 80 GPU-hours, showcasing its efficiency and potential for practical applications in early-warning systems and renewable energy forecasting.
Methodology
The MR-GNF framework employs a tri-band ellipsoidal mesh design that allows for cross-scale message passing without explicit nested boundaries. It features an axial graph-attention network that alternates between vertical self-attention across pressure levels and horizontal graph attention across surface nodes, enabling efficient representation of 3D atmospheric dynamics. The model is trained on extensive ERA5 reanalysis data, incorporating various atmospheric variables and static features.
Results
MR-GNF successfully produces +6 hour one-step predictions and +24 hour autoregressive rollouts, achieving performance metrics comparable to heavier regional AI systems while using only 1.6 million parameters and requiring less than 80 GPU-hours for training.
Implications
The findings suggest that MR-GNF can serve as a practical alternative to traditional NWP methods, offering a scalable and efficient approach for high-resolution weather forecasting. This could significantly enhance the capabilities of early-warning systems and improve renewable energy forecasting, making it a valuable tool for regional stakeholders.
From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness
Theory
- Predictive robustness arises from the synergy between data architecture and model capacity, not just data cleanliness.
- High-dimensional, error-prone predictors can effectively mitigate noise in predictive modeling.
- Informative collinearity enhances model reliability and convergence efficiency.
- Proactive Data-Centric AI strategies can optimize predictor selection for better robustness.
Read more
From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness
Summary
This paper addresses the paradox in tabular machine learning where high-dimensional, collinear, and error-prone data can still yield state-of-the-art model performance, contradicting the 'Garbage In, Garbage Out' principle. The authors propose a data-architectural theory of predictive robustness, emphasizing that robustness is not solely dependent on data cleanliness but rather on the interplay between data architecture and model capacity. They introduce concepts such as 'Predictor Error' and 'Structural Uncertainty' to differentiate types of noise in predictor space. The paper argues that high-dimensional, error-prone predictors can asymptotically mitigate both types of noise, while low-dimensional predictors are limited by structural uncertainty. The authors advocate for 'Informative Collinearity' as beneficial for model reliability and convergence efficiency. They also present 'Proactive Data-Centric AI' as a strategy for efficient predictor selection and discuss the implications of systematic error regimes. The framework suggests a shift from traditional model transfer to methodology transfer, enabling learning from uncurated data streams, thus redefining data quality from item-level perfection to portfolio-level architecture.
Methodology
The authors synthesize principles from Information Theory, Latent Factor Models, and Psychometrics to develop their theoretical framework. They analyze predictor-space noise, differentiate between types of errors, and propose a proactive approach to data-centric AI for feature selection. The paper includes theoretical analysis and simulation studies to validate their claims.
Results
The findings demonstrate that leveraging high-dimensional predictors can overcome both predictor error and structural uncertainty, leading to improved predictive robustness. The paper also establishes the benefits of informative collinearity and outlines the operational boundaries for traditional data-centric AI approaches.
Implications
This research has significant implications for the development of machine learning systems that can effectively utilize uncurated data streams, particularly in enterprise environments. It encourages a shift in focus towards methodology transfer, which could enhance the adaptability and generalizability of machine learning models in dynamic data contexts.
DreamReader: An Interpretability Toolkit for Text-to-Image Models
Generative Models
Interpretability
Multimodal
- DreamReader provides a unified framework for interpretability in T2I diffusion models.
- Introduces novel intervention techniques such as LoReFT and classifier-guided gradient steering.
- Facilitates systematic analysis and intervention across different diffusion architectures.
- Demonstrates effective control over generated images through targeted interventions.
Read more
DreamReader: An Interpretability Toolkit for Text-to-Image Models
Summary
The paper introduces DreamReader, a comprehensive interpretability toolkit designed for text-to-image (T2I) diffusion models. Despite the rapid growth in T2I technologies, existing interpretability methods are often fragmented and limited to specific probing techniques. DreamReader aims to unify these methods by providing a model-agnostic framework that formalizes diffusion interpretability through composable representation operators. This toolkit includes novel intervention primitives such as representation fine-tuning (LoReFT), classifier-guided gradient steering, and component-level cross-model mapping. These innovations allow for lightweight white-box interventions on T2I models, drawing inspiration from interpretability techniques used in large language models (LLMs). The authors demonstrate the effectiveness of DreamReader through controlled experiments, including activation stitching between models and the application of LoReFT to guide activation units, successfully injecting target concepts into generated images. The framework is designed to facilitate reproducible large-scale analysis and is released as an open-source toolkit, promoting further research in T2I interpretability.
Methodology
The authors developed DreamReader as a model-agnostic abstraction layer that consolidates existing interpretability techniques for T2I models. They introduced three new intervention primitives and conducted controlled experiments to validate the framework's effectiveness in manipulating and understanding the internal workings of diffusion models.
Results
The experiments showed that DreamReader could successfully perform activation stitching between different models and apply LoReFT to steer activation units, allowing for the reliable injection of target concepts into generated images. The framework's design supports reproducible analysis and demonstrates promising results in controlling T2I outputs.
Implications
DreamReader's toolkit has the potential to enhance the interpretability of T2I models, enabling researchers and practitioners to better understand and control the behavior of these systems. This could lead to improved reliability and reduced biases in generated outputs, making T2I models more trustworthy for various applications.
AI-Driven Predictive Maintenance with Real-Time Contextual Data Fusion for Connected Vehicles: A Multi-Dataset Evaluation
Multimodal
Time Series
Interpretability
- Introduction of a multi-source contextual fusion architecture for predictive maintenance.
- Demonstrated significant improvement in classification accuracy with the inclusion of contextual features.
- Achieved high performance on a real-world predictive maintenance dataset.
- Provided empirical evidence of model robustness against noise.
Read more
AI-Driven Predictive Maintenance with Real-Time Contextual Data Fusion for Connected Vehicles: A Multi-Dataset Evaluation
Summary
This paper presents a novel framework for predictive maintenance in connected vehicles, integrating internal diagnostic signals with external contextual data sourced through Vehicle-to-Everything (V2X) communication. The authors argue that traditional predictive maintenance systems, which rely solely on internal diagnostics, fail to account for external factors that significantly influence vehicle component degradation. The proposed framework processes data at the vehicle edge, allowing for low-latency inference and improved maintenance scheduling. The authors conducted a series of experiments, including a feature group ablation study that demonstrated a 2.6-point F1 improvement when contextual features were included. They benchmarked their classification pipeline on the AI4I 2020 Predictive Maintenance Dataset, achieving an AUC-ROC of 0.973. Additionally, a noise sensitivity analysis showed that the model maintained robust performance under varying levels of noise. The paper emphasizes the need for field validation on instrumented vehicles as the next step for practical deployment.
Methodology
The authors developed a contextual data fusion architecture that integrates vehicle-internal sensor data with external contextual signals from V2X communication. They conducted a feature group ablation study, benchmarked their model on the AI4I 2020 dataset, and performed a noise sensitivity analysis to evaluate model robustness. SHAP analysis was used for interpretability of feature importance.
Results
The framework showed a 2.6-point F1 improvement when contextual features were included in the predictive maintenance model. The LightGBM model achieved an AUC-ROC of 0.973 on the AI4I 2020 dataset. The noise sensitivity analysis indicated that the model maintained a macro F1 score above 0.88 under clean to moderate noise levels, degrading to 0.74 at higher noise levels.
Implications
The proposed framework has the potential to enhance predictive maintenance strategies for connected vehicles, leading to reduced breakdowns and maintenance costs. The integration of contextual data could improve the accuracy of remaining useful life predictions, ultimately contributing to more reliable vehicle operation and maintenance scheduling.
As Language Models Scale, Low-order Linear Depth Dynamics Emerge
NLP
Large Language Models
Interpretability
- Low-order linear surrogates can accurately capture the depth dynamics of transformer models.
- The fidelity of these surrogates improves with the size of the language model.
- Linear surrogates enable more efficient multi-layer interventions compared to heuristic methods.
- The study reveals a systems-level regularity in the dynamics of scaling language models.
Read more
As Language Models Scale, Low-order Linear Depth Dynamics Emerge
Summary
This paper investigates the dynamics of transformer-based language models, particularly focusing on how their depth dynamics can be approximated by low-order linear surrogates as the models scale. The authors demonstrate that these linear surrogates can effectively reproduce the layerwise sensitivity profiles of the GPT-2-large model across various tasks, including toxicity detection, irony, hate speech, and sentiment analysis. They reveal a scaling principle where the accuracy of the linear surrogate improves with the model size, indicating that larger models allow for better linear approximations of their dynamics. This finding suggests that as language models grow in size, their local depth dynamics become increasingly tractable and predictable. The authors also propose that these linear surrogates can facilitate more efficient multi-layer interventions, requiring less energy compared to traditional heuristic methods. Overall, the study provides a systems-theoretic foundation for analyzing and controlling large language models, suggesting a shift in how these models can be studied and manipulated.
Methodology
The authors treat the depth of the transformer as discrete time and model the last-token hidden state as the system's state. They identify a low-dimensional linear surrogate that approximates the state propagation through transformer blocks, allowing for the prediction of how interventions at specific layers influence the final output. They validate this approach across various tasks and model sizes, comparing the performance of the linear surrogate against the full model.
Results
The study finds that a 32-dimensional linear surrogate can reproduce the layerwise sensitivity profiles of GPT-2-large with near-perfect accuracy. Additionally, the agreement between the surrogate and the full model improves monotonically with the size of the GPT-2 family. The linear surrogate also facilitates more effective intervention strategies that consume less energy than conventional methods.
Implications
The findings suggest that as language models scale, they not only become more capable but also more amenable to analysis through simpler, low-dimensional models. This could lead to more efficient methods for probing and intervening in large language models, enhancing their controllability and interpretability.
How Log-Barrier Helps Exploration in Policy Optimization
Reinforcement Learning
Optimization
Theory
- Introduction of LB-SGB, which ensures a minimum level of exploration in policy optimization.
- Theoretical guarantees for LB-SGB include O(ϵ−1) sample complexity and convergence without unrealistic assumptions.
- Connection established between log-barrier regularization and Natural Policy Gradient, emphasizing the importance of Fisher information.
- Empirical results show LB-SGB's superior performance in convergence compared to SGB and NPG.
Read more
How Log-Barrier Helps Exploration in Policy Optimization
Summary
This paper addresses the limitations of the Stochastic Gradient Bandit (SGB) algorithm in policy optimization, particularly its lack of an explicit exploration mechanism. The authors propose a new algorithm, Log-Barrier Stochastic Gradient Bandit (LB-SGB), which incorporates a log-barrier regularization to enforce a minimum level of exploration. Theoretical analysis shows that LB-SGB matches the sample complexity of SGB while eliminating the need for unrealistic assumptions about the learning process. The paper also establishes a connection between log-barrier regularization and Natural Policy Gradient (NPG), highlighting the role of Fisher information in exploration. Numerical simulations validate the theoretical findings, demonstrating that LB-SGB outperforms both SGB and NPG in terms of convergence to the optimal policy, especially as the number of arms increases.
Methodology
The authors analyze the dynamics of policy optimization using the MAB framework and introduce LB-SGB, which incorporates log-barrier regularization. They provide theoretical proofs for convergence guarantees and sample complexity, and validate their findings through numerical simulations.
Results
LB-SGB achieves O(ϵ−1) sample complexity, comparable to state-of-the-art algorithms, and converges at a slower rate of O(ϵ−7) without requiring assumptions on the sampling probability of the optimal arm. Empirical results indicate that LB-SGB improves convergence to the optimal policy as the number of arms increases.
Implications
The findings suggest that incorporating log-barrier regularization can significantly enhance exploration in policy optimization, potentially leading to more robust and efficient reinforcement learning algorithms. This could have applications in various domains where exploration-exploitation trade-offs are critical.
Directional Routing in Transformers
NLP
Large Language Models
Interpretability
- Directional routing enhances transformer efficiency with minimal parameter overhead.
- Routing is the dominant computational mechanism, crucial for factual recall and induction tasks.
- Disabling routing leads to a significant drop in model performance, while individual attention heads show redundancy.
- The model organizes into adaptive and fixed routing regimes, optimizing performance across layers.
Read more
Directional Routing in Transformers
Summary
This paper introduces a novel mechanism called directional routing for transformers, which allows each attention head to learn suppression directions controlled by a shared router. This mechanism incurs a minimal parameter cost of 3.9% and is integrated into a 433M-parameter transformer model. The study reveals that routing becomes the primary computational pathway of the model, significantly impacting factual recall and induction tasks. When routing is disabled, the model's performance collapses, indicating its critical role. The model self-organizes into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in later layers. The findings suggest that while the coordination mechanism of routing is essential, the individual components it manages are interchangeable. Overall, routing leads to a reduction in perplexity by 31-56% compared to the baseline model, although these improvements are not yet reflected in downstream benchmarks.
Methodology
The authors augment a standard transformer architecture by adding directional routing, which involves each attention head learning four unit-norm direction vectors and utilizing a shared MLP router to control suppression patterns in the output. The model is trained using a language modeling objective without auxiliary losses, and the impact of routing is analyzed through circuit analysis on factual recall and induction tasks.
Results
The study finds that routing is essential for maintaining factual recall and induction accuracy, with disabling it collapsing performance to near-zero. The model exhibits a 31-56% reduction in perplexity compared to the baseline, and individual attention heads are shown to be largely interchangeable, with the routing mechanism being the critical component.
Implications
The findings suggest that directional routing could lead to more efficient transformer architectures with enhanced interpretability and performance. This mechanism may be applicable in various NLP tasks, improving model robustness and understanding of attention mechanisms.