AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
OPRD: On-Policy Representation Distillation
NLP
Large Language Models
Efficient ML
- OPRD shifts the focus of on-policy distillation from output probabilities to hidden-state representations.
- The method eliminates sampling variance in gradient estimation, leading to more stable training.
- OPRD exposes rich structural information from the teacher's intermediate layers, enhancing supervision.
- Empirical results show OPRD outperforms traditional OPD methods on competitive benchmarks.
Read more
OPRD: On-Policy Representation Distillation
Summary
The paper introduces On-Policy Representation Distillation (OPRD), a novel method for distilling knowledge from large language models (LLMs) that shifts the focus from output-space distillation to hidden-state representation alignment. Traditional on-policy distillation (OPD) methods primarily supervise student models by matching next-token probabilities, which leads to two significant limitations: high sampling variance in gradient estimation and loss of rich structural information from the teacher's intermediate hidden states. OPRD addresses these issues by aligning the student's intermediate representations with the teacher's across selected layers during on-policy rollouts, thus providing dense, deterministic supervision. The authors demonstrate that OPRD eliminates the sampling variance associated with OPD and exposes structural information that is otherwise discarded. Empirical results show that OPRD significantly improves performance on three competitive mathematics benchmarks, achieving faster training times and reduced memory usage compared to existing OPD methods. This work opens a new avenue for representation-level distillation in the on-policy regime, enhancing the efficiency and effectiveness of LLM training.
Methodology
OPRD aligns the student's intermediate hidden representations with those of the teacher across selected transformer layers and response positions using a normalized mean-squared error objective during on-policy rollouts. This method allows for deterministic supervision without the noise associated with sampling-based approaches.
Results
OPRD closed the student-teacher performance gap on three mathematics benchmarks (AIME 2024, AIME 2025, AIMO), achieving a 2.7-point accuracy gain, 1.44ร faster training, and up to 54% less transient memory usage compared to top-k OPD methods.
Implications
The introduction of OPRD could lead to more efficient training of large language models, particularly in applications requiring high accuracy and low resource consumption. It also opens new research directions in representation-level distillation techniques.
Generative Criticality in Large Language Model Temperature Scaling
NLP
Large Language Models
Theory
- Introduction of a statistical-field framework for analyzing LLM outputs.
- Identification of critical behavior in LLM text generation driven by temperature scaling.
- Validation of findings through intrinsic dimension estimation using the TwoNN method.
- Observations of susceptibility peaks and order parameter changes near a critical temperature.
Read more
Generative Criticality in Large Language Model Temperature Scaling
Summary
This paper introduces a statistical-field framework to analyze text generated by large language models (LLMs), treating token embeddings as continuous spin variables on a one-dimensional chain. The authors define a susceptibility based on the connected two-point correlator and an order parameter derived from the ensemble-averaged embedding field. By varying the softmax temperature T, they observe a sharp peak in susceptibility near a critical temperature Tc, alongside a rapid change in the order parameter and a collapse onto a single semantic direction below Tc. The intrinsic dimension, estimated using the TwoNN method, corroborates these findings, revealing a minimum near Tc. The results are consistent across various model scales (Qwen3: 0.6Bโ32B) and prompt categories, suggesting a critical behavior reminiscent of continuous phase transitions. However, the non-equilibrium nature of autoregressive generation indicates the need for further exploration. The proposed framework offers quantitative tools for examining the collective statistical structure of LLM outputs and hints at connections between decoding strategies and critical phenomena.
Methodology
The authors construct a statistical-mechanics framework for LLM-generated text, defining susceptibility and an order parameter. They analyze the behavior of these quantities as a function of the softmax temperature T, applying tools from statistical field theory and intrinsic dimension estimation to probe the critical structure of LLM outputs.
Results
The study finds a pronounced peak in susceptibility near a critical temperature Tc, with power-law scaling observed. The order parameter exhibits a rapid change around Tc, and the intrinsic dimension shows non-monotonic features, indicating critical behavior. These results are consistent across various model scales and prompt categories.
Implications
The findings suggest that understanding the critical behavior of LLM outputs can enhance the design of decoding strategies and improve the interpretability of generated text. The framework may also provide insights into the emergent properties of language models and their underlying statistical structures.
A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
Time Series
Theory
Efficient ML
- Introduction of a Sliced-Wasserstein framework for EEG decoding using correlation matrices.
- Development of Pullback Euclidean Metric Sliced Wasserstein (PEMSW) discrepancies.
- Demonstration of improved generalization in EEG decoding under distribution shifts.
- Low training overhead and no additional inference cost associated with the proposed method.
Read more
A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
Summary
This paper presents a novel framework for EEG decoding using Sliced-Wasserstein (SW) discrepancies on correlation matrices. EEG signals, while providing high temporal resolution for brain activity monitoring, are often noisy and exhibit significant variability across subjects, leading to challenges in generalization. Traditional EEG decoding methods typically rely on covariance matrices, which can be sensitive to scaling. The authors propose using full-rank correlation matrices as a scale-invariant alternative, leveraging their ability to emphasize inter-channel dependencies. The framework introduces Pullback Euclidean Metric Sliced Wasserstein (PEMSW) discrepancies, which are applicable to manifolds endowed with Pullback Euclidean Metrics. Two specific instances of Correlation Sliced-Wasserstein (CorSW) discrepancies are developed based on the Off-Log Metric (OLM) and Log-Scaled Metric (LSM) for full-rank correlation matrices. The proposed method is integrated into a domain generalization framework for EEG decoding, demonstrating improved performance across three EEG datasets, particularly under distribution shifts, with minimal training overhead and no additional inference costs. The findings suggest that the CorSW discrepancies can effectively enhance the robustness of EEG decoding methods, making them more applicable in real-world scenarios.
Methodology
The authors propose a general framework for Sliced Wasserstein discrepancies on manifolds with Pullback Euclidean Metrics. They specifically develop two Correlation Sliced-Wasserstein discrepancies based on the Off-Log Metric and Log-Scaled Metric for full-rank correlation matrices. The methodology involves leveraging the geometric properties of these metrics to enhance the robustness of EEG decoding through domain generalization techniques.
Results
Experiments conducted on three EEG datasets showed that the proposed CorSW discrepancies significantly improved generalization performance under distribution shifts, indicating that the method is effective in addressing the challenges posed by noise and variability in EEG signals. The approach also maintained low training overhead and did not incur additional costs during inference.
Implications
The findings suggest that the proposed Sliced-Wasserstein framework can enhance the robustness and reliability of EEG decoding methods, which could lead to improved applications in clinical settings, such as seizure detection and brain-computer interfaces. The approach may also inspire further research into the use of correlation matrices in other domains where scale-invariance is critical.
Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning
Reinforcement Learning
Graph Learning
Optimization
- Introduces a reinforcement learning framework for optimizing football corner kick tactics.
- Formulates corner kick optimization as a Markov Decision Process (MDP) to facilitate reward-driven exploration.
- Utilizes Graph Neural Networks to capture the spatial dynamics of player positions.
- Demonstrates superior performance compared to traditional optimization methods on real-world data.
Read more
Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning
Summary
This paper presents a novel approach to optimizing football corner kick tactics using Graph Reinforcement Learning (Graph RL). Unlike traditional methods that focus on historical data and imitation of past actions, this work aims to discover new, generalizable player configurations and strategies. The authors formulate the corner kick optimization problem as a Markov Decision Process (MDP), where a central policy adjusts attacking player positions and velocities to maximize the Expected First Contact Shot (xFCS) probability. The methodology integrates Graph Neural Network (GNN) embeddings with deep reinforcement learning techniques, specifically Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO). The approach is evaluated on over 3,000 Premier League corner scenarios, demonstrating significant improvements over baseline optimization techniques such as Random Search and Simulated Annealing. The results indicate that the proposed method not only enhances tactical variations but also shifts the focus from historical imitation to reward-driven tactical discovery, providing actionable insights for coaches.
Methodology
The authors developed a Graph Reinforcement Learning method that integrates GNN embeddings with deep reinforcement learning algorithms (SAC and PPO) to learn a general policy for adjusting attacking player positions during corner kicks. The optimization problem is framed as an MDP, allowing the agent to explore various configurations to maximize the xFCS probability.
Results
The proposed method significantly outperformed baseline optimization techniques, achieving higher expected tactical yields on a dataset of over 3,000 Premier League corners. The approach maintained competitive performance even when traditional methods were given disproportionately larger search budgets.
Implications
This research has the potential to transform set-piece analysis in football by enabling coaches to discover novel tactical strategies that could enhance team performance. The framework could also be adapted for other sports or tactical scenarios where structured decision-making is critical.
DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables
Robotics
Optimization
Theory
- DiffSlack reformulates nonlinear inequalities as equalities using learnable slack variables, enhancing constraint satisfaction in neural networks.
- The framework incorporates a differentiable projection layer that allows for end-to-end training while ensuring feasibility.
- A two-stage curriculum learning approach stabilizes training and improves performance on complex tasks.
- DiffSlack achieves superior results in vehicle path planning, outperforming traditional and learning-based methods.
Read more
DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables
Summary
The paper presents DiffSlack, a novel framework designed to address the challenges of enforcing nonlinear inequality constraints in neural networks, particularly in scenarios with multiple coupled constraints. Traditional methods often struggle with computational efficiency and structural limitations. DiffSlack introduces a differentiable projection layer that reformulates inequalities into equalities using learnable slack variables, which are integrated into the network's output. This approach allows for a data-driven initialization for the damped Gauss-Newton projection, ensuring that the raw predictions are mapped onto a feasible manifold while maintaining end-to-end differentiability. The authors implement a two-stage curriculum learning strategy to stabilize training, first using soft constraint losses and then activating the hard projection layer. The framework is evaluated in the context of vehicle path planning, where it successfully navigates 200 nonlinear inequality constraints related to safety and operational limits. The results indicate that DiffSlack outperforms existing methods in terms of planning success rates and geometric constraint satisfaction, demonstrating its practical applicability in engineering tasks.
Methodology
DiffSlack employs a differentiable projection layer that reformulates nonlinear inequality constraints into equalities with learnable slack variables. This allows the network to predict slack variables alongside task outputs, facilitating a data-driven initialization for the projection process. A two-stage curriculum learning strategy is utilized to stabilize training, first focusing on soft constraints and then activating hard projections for improved constraint satisfaction.
Results
In experiments focused on vehicle path planning, DiffSlack demonstrated a higher planning success rate and better geometric constraint satisfaction compared to existing learning-based baselines. The framework was tested against 200 nonlinear inequality constraints, showing significant improvements in both simulation environments like CARLA and in real-world vehicle tracking applications.
Implications
DiffSlack offers a scalable and efficient method for embedding hard inequality constraints into neural networks, making it suitable for safety-critical applications in robotics and autonomous systems. Its ability to ensure feasibility while maintaining performance could lead to advancements in various engineering fields where constraint satisfaction is crucial.
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Optimization
Theory
Efficient ML
- DP-MacAdam combines adaptive clipping and adaptive momentum under differential privacy.
- The algorithm uses the same empirical statistics for both clipping and momentum, enhancing training efficiency.
- A novel bias correction factor is introduced for unbiased gradient variance estimation.
- Empirical results show improved performance over existing DP optimizers without manual tuning.
Read more
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Summary
The paper introduces DP-MacAdam, a novel algorithm that integrates adaptive clipping and adaptive momentum within the framework of differentially private stochastic gradient descent (DP-SGD). Traditional DP-SGD relies on a fixed gradient clipping threshold, which can hinder performance due to its sensitivity to the choice of threshold. Existing adaptive clipping methods, such as AdaClip, improve gradient clipping by dynamically adjusting based on empirical statistics but do not utilize these statistics for momentum updates. Conversely, DP-Adam leverages momentum but does not incorporate adaptive clipping. DP-MacAdam addresses these limitations by using the same empirical mean and variance estimates for both adaptive clipping and momentum updates, thus enhancing the model's training efficiency without incurring additional privacy costs. The authors provide a bias correction factor for estimating gradient variance, ensuring unbiasedness despite the presence of differential privacy noise. Empirical evaluations demonstrate that DP-MacAdam outperforms existing state-of-the-art differentially private optimizers on benchmark datasets like MNIST and CIFAR-10, achieving better model utility without the need for manual tuning of clipping thresholds.
Methodology
The authors propose DP-MacAdam, which integrates AdaClip's adaptive clipping strategy with DP-Adam's momentum updates. The algorithm utilizes empirical mean and variance estimates from the gradients to dynamically adjust both clipping thresholds and momentum, ensuring that the updates are more informative and effective. A bias correction factor is derived to provide unbiased estimates of gradient variance, accounting for noise and initialization biases.
Results
DP-MacAdam was empirically evaluated on datasets such as MNIST and CIFAR-10, showing superior performance compared to DP-SGD, AdaClip, and DP-Adam across various privacy budgets. The results indicate that DP-MacAdam achieves better model utility without requiring manual tuning of the clipping threshold, thus simplifying the implementation of differentially private training.
Implications
The development of DP-MacAdam has significant implications for privacy-preserving machine learning, particularly in scenarios where sensitive data is involved. By improving the efficiency of training algorithms while maintaining strong privacy guarantees, this work can facilitate the deployment of machine learning models in sensitive applications such as healthcare and finance.
Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction
Time Series
Theory
Efficient ML
- CAUSALLONGPFN is the first PFN-style model for longitudinal causal prediction under planned treatment sequences.
- The model is pretrained on synthetic data, allowing it to handle complex longitudinal dynamics without retraining.
- It provides competitive performance against traditional longitudinal causal estimators on various benchmarks.
- The approach eliminates the need for domain-specific training at test time, making it more efficient.
Read more
Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction
Summary
This paper introduces Causal Longitudinal Prior-Fitted Networks (CAUSALLONGPFN), a novel approach for predicting potential outcomes in longitudinal treatment scenarios characterized by time-varying confounding and heterogeneous patient dynamics. Unlike traditional longitudinal causal estimators that require separate models for each cohort, CAUSALLONGPFN leverages a pre-trained model that can make predictions without the need for gradient updates or domain-specific training at test time. The model is pre-trained on synthetic data generated from a broad prior over temporal structural causal models, allowing it to learn complex dynamics such as treatment-confounder feedback and delayed effects. At evaluation, CAUSALLONGPFN conditions on historical data and proposed treatment sequences to produce a predictive distribution of future outcomes. The authors evaluate the model against established baselines on various benchmarks, including cancer, HIV, and warfarin treatment scenarios, demonstrating its competitive performance in both counterfactual and factual predictions. The findings suggest that broad synthetic causal pretraining can serve as an effective alternative to repeated domain-specific training, particularly in settings where such training is costly or impractical.
Methodology
CAUSALLONGPFN is pre-trained on synthetic episodes that incorporate treatment-confounder feedback and other complex dynamics. At test time, the model is frozen and uses support trajectories and a query history to predict outcomes under specified treatment sequences without further training or fitting.
Results
CAUSALLONGPFN demonstrated competitive performance against established longitudinal causal models on benchmarks with ground-truth counterfactual labels and performed strongly in factual predictions using real-world clinical data from MIMIC-III, indicating its effectiveness as a frozen predictor.
Implications
The model's ability to provide accurate predictions without the need for extensive retraining could significantly reduce the computational burden in longitudinal causal inference, making it applicable in clinical settings where rapid decision-making is crucial.
A prism hierarchy of learning regimes in large linear autoencoders
Theory
Optimization
- Introduction of a prism hierarchy to classify extreme learning regimes in linear autoencoders.
- Identification of five basic extreme regimes associated with specific hyperparameter scaling relations.
- Extension of diagram-based methods to analyze finite training sets, enabling separate study of train and population losses.
- Derivation of explicit loss evolution expressions for four of the five regimes, with strong empirical validation.
Read more
A prism hierarchy of learning regimes in large linear autoencoders
Summary
This paper presents a comprehensive theoretical framework for understanding the learning dynamics of large weight-tied linear autoencoders. The authors propose a prism hierarchy that categorizes the extreme learning regimes based on input and latent dimensions, initialization magnitude, and training set size. They identify five basic regimes: large-data, small-data, mean-field, narrow-latent, and free, each associated with specific scaling relations among hyperparameters. The authors extend the diagram-based method from previous work to analyze finite training sets, allowing for the separate study of train and population learning trajectories. They derive explicit expressions for loss evolutions in four of the five regimes, demonstrating strong agreement with experimental results. This work provides a systematic understanding of the learning dynamics in linear autoencoders, highlighting the importance of regime classification in theoretical machine learning research.
Methodology
The authors utilize a diagram-based method to analyze the learning dynamics of weight-tied linear autoencoders. They introduce data-related edges and nodes in the diagrams to extend the analysis to finite training sets. The study involves deriving explicit expressions for train and population loss evolutions under gradient flow in various extreme regimes.
Results
The paper successfully categorizes the learning dynamics into five extreme regimes and provides explicit formulas for loss evolution in four of these regimes. The results show excellent agreement with experimental data, particularly in the large-data regime, where previous solutions were known, while new solutions were derived for the other three regimes.
Implications
This work enhances the theoretical framework for understanding learning dynamics in large linear models, which can inform the design and training of autoencoders and similar architectures. It emphasizes the significance of regime classification in machine learning research, potentially leading to improved generalization and performance in practical applications.
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Reinforcement Learning
Large Language Models
Optimization
- Identification of three failure modes in group-relative policy updates: low-variance amplification, mean-centering blindness, and zero-variance collapse.
- Introduction of multi-temperature sampling to enhance reward diversity in small-batch settings.
- Development of dual-anchor advantages to restore learning signals in problematic reward scenarios.
- Application of bounded, asymmetric advantage shaping based on Prospect Theory to improve robustness.
Read more
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Summary
This paper addresses the challenges of reinforcement learning (RL) in multi-constraint instruction following, particularly focusing on the instability of standard group-relative policy optimization (GRPO) under discrete, low-dispersion rewards. The authors identify three critical pathologies associated with z-score group normalization: low-variance amplification, mean-centering blindness, and zero-variance collapse. To mitigate these issues, they propose MDP-GRPO, which incorporates several innovative strategies: multi-temperature sampling to enhance reward dispersion, dual-anchor advantages to restore learning signals in homogeneous groups, prospect-theoretic shaping to bound updates and penalize violations, and asymmetric KL regularization. The proposed method is evaluated on various benchmarks, including FollowBench and IFEval, demonstrating significant improvements in constraint satisfaction and stable convergence, particularly with small group sizes, while maintaining general capabilities on tasks like MMLU and ARC.
Methodology
The authors propose MDP-GRPO, which integrates multi-temperature sampling for reward diversity, dual-anchor advantages for effective learning signals, and prospect-theoretic shaping for bounded updates. These components are designed to address the identified pathologies in group-relative policy optimization.
Results
MDP-GRPO outperforms standard GRPO, achieving up to a 5.0% improvement in strict constraint satisfaction on the Llama-3.2-3B model. The method also ensures stable convergence with small group sizes and retains general capabilities across various benchmarks.
Implications
The findings suggest that MDP-GRPO can enhance the reliability of reinforcement learning in applications requiring strict adherence to multiple constraints, such as legal and technical document generation, while also being adaptable to other reward structures.
On the training of physics-informed neural operators for solving parametric partial differential equations
Optimization
Theory
Efficient ML
- CViT architecture consistently delivers strong performance in PINO training.
- Optimization challenges such as gradient imbalance and causal violation arise in PINOs.
- Mitigation strategies from PINN training improve PINO predictive accuracy.
- Physics-informed training can outperform data-driven approaches under certain conditions.
Read more
On the training of physics-informed neural operators for solving parametric partial differential equations
Summary
This paper investigates the training of physics-informed neural operators (PINOs), which aim to learn solution operators for parametric partial differential equations (PDEs) by integrating physical constraints into the training process. Unlike traditional data-driven approaches, PINOs leverage the governing physics as supervision, enhancing data efficiency and generalization across varying parameter configurations. The authors systematically analyze the training pipeline of PINOs, focusing on architecture design, optimizer selection, loss balancing, and collocation-point sampling strategies. They evaluate three operator backbonesโDeep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT)โacross five diverse parametric PDE systems. The findings reveal that CViT consistently outperforms other architectures in terms of stability and performance. The study also identifies optimization challenges similar to those faced in physics-informed neural networks (PINNs), such as gradient conflicts and causal violations, and demonstrates that mitigation strategies from PINN training can be effectively applied to PINOs. Furthermore, the authors show that a well-designed physics-informed training pipeline can match or even exceed the performance of purely data-driven neural operators, providing a comprehensive understanding of the optimization challenges in PINO training and offering practical guidance for efficient physics-informed operator learning.
Methodology
The authors conducted a systematic empirical study of PINO training, evaluating three operator backbones (CViT, DeepONet, and FNO) across five parametric PDE systems. They analyzed key components of the training pipeline, including architecture design, optimizer choice, loss balancing, and sampling strategies, while also applying mitigation algorithms from PINN training to address optimization pathologies.
Results
The study found that CViT consistently provided the best performance across the evaluated benchmarks. It also confirmed that optimization issues identified in PINN training, such as gradient conflicts and causal violations, were present in PINOs. The application of PINN mitigation strategies led to improved accuracy in PINO predictions. Additionally, the research demonstrated that a physics-informed training pipeline could match or exceed the performance of data-driven models in various data regimes.
Implications
The findings suggest that PINOs can serve as effective tools for solving parametric PDEs in scientific computing, particularly in scenarios where labeled data is scarce. The structured training pipeline proposed can guide future research and applications in physics-informed machine learning, potentially impacting fields such as engineering, materials science, and real-time prediction.
Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
Computer Vision
- Prototype rehearsal can be competitive in EFCIL if redesigned to consider nearest-enemy information and class imbalance.
- Constrained Expansive Over-Sampling (CEOS) generates boundary-aware rehearsal samples that respect the data manifold.
- Adaptive Class-Balanced (ACB) loss addresses the imbalance between old and new classes through temporal weighting.
- The proposed methods achieve state-of-the-art performance in EFCIL benchmarks, closing the gap with drift-compensation methods.
Read more
Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
Summary
This paper addresses the challenges of exemplar-free class-incremental learning (EFCIL), particularly focusing on prototype rehearsal methods that have been traditionally used to mitigate catastrophic forgetting. The authors argue that the limitations of existing prototype rehearsal techniques stem from their failure to account for the geometry of the data manifold and the imbalance between old and new classes. To overcome these issues, they propose a novel approach that includes Constrained Expansive Over-Sampling (CEOS) and an Adaptive Class-Balanced (ACB) loss. CEOS generates boundary-aware rehearsal samples by interpolating old-class prototypes towards their nearest enemy features from new classes, ensuring that the samples remain on the correct side of the decision boundary. The ACB loss incorporates time-based class weighting to address the imbalance in sample counts between old and new classes, amplifying the influence of older prototypes when they are most informative. The proposed methods are shown to enhance the performance of prototype rehearsal, making it competitive with, and often superior to, recent drift-compensation methods across multiple EFCIL benchmarks.
Methodology
The authors introduce two main components: Constrained Expansive Over-Sampling (CEOS) and Adaptive Class-Balanced (ACB) loss. CEOS interpolates old-class prototypes towards nearest enemy features to create boundary-aware samples, while ACB loss applies time-based weighting to balance the influence of old and new classes during training.
Results
The proposed methods demonstrate significant improvements in performance on multiple EFCIL benchmarks, often surpassing existing drift-compensation techniques. The integration of CEOS and ACB loss effectively mitigates catastrophic forgetting while maintaining competitive accuracy.
Implications
This work has implications for developing more effective continual learning systems that can operate under strict memory and privacy constraints, making it suitable for applications in fields such as robotics, autonomous systems, and real-time data processing.
From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems
Theory
Robotics
Time Series
- Identifies four critical developmental conditions for agency in neural systems.
- Introduces agency gain as a measurable metric for self-awareness in predictive systems.
- Demonstrates that self-representation is contingent on causal usefulness.
- Falsifies 12 hypotheses that clarify the limitations of predictive coding and passive memory.
Read more
From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems
Summary
This paper investigates the transition from a predictive system to one that can distinguish its own causal influence, focusing on a minimal 192-dimensional Gated Recurrent Unit (GRU). Through 40 controlled experiments arranged in a developmental sequence, the author identifies four necessary conditions for a system to achieve self-world decomposition: (1) persistent state that forms stable attractors, (2) a causal action loop linking the system's output to its input, (3) proprioceptive feedback that makes implicit causal knowledge explicit, and (4) asynchronous awakening, where perceptual learning must consolidate before action learning begins. The study introduces the concept of agency gain (A = Errworld โ Errself) as a metric to measure the predictive advantage of a system that understands its own actions. The final configuration of the system demonstrates consistent performance improvements over a self-blind predictor in both periodic and chaotic environments. The paper also presents 12 falsified hypotheses that clarify the boundaries between predictive systems and those that possess self-awareness, emphasizing that self-representation is only maintained when it is causally useful.
Methodology
The methodology involves a series of 40 controlled experiments using a minimal GRU architecture. Each experiment systematically adds components to the system, allowing for precise attribution of changes in performance metrics. The experiments are designed to evaluate the system's ability to distinguish between self-caused and world-caused changes in observations.
Results
The experiments reveal that the self-aware predictor consistently outperforms the self-blind predictor across different environments. The proposed agency gain metric effectively tracks the developmental process, and the final configuration retains a high level of causal encoding even after the removal of external training signals. The study also identifies several alternative approaches that fail to produce self-representation, delineating the necessary conditions for agency.
Implications
The findings have implications for understanding the development of agency in artificial systems, potentially informing the design of more advanced neural architectures that can achieve self-awareness. This research could also contribute to fields such as developmental robotics and cognitive science by providing insights into the mechanisms underlying self-representation and causal attribution.
Staged Factorial Screening for Budget-Constrained Micro-Pretraining
Optimization
Efficient ML
Theory
- Introduces a staged screening methodology for micro-pretraining that emphasizes factorial design and local refinement.
- Demonstrates that early recipe effects are significantly influenced by budget constraints, particularly in terms of batch size and model depth.
- Identifies specific configurations that retain performance effects under budget constraints, while random searches can yield competitive results without clear factor attribution.
- Recommends a bridge-centered approach for effective long-term performance across different hardware setups.
Read more
Staged Factorial Screening for Budget-Constrained Micro-Pretraining
Summary
This paper addresses the challenge of budget-constrained micro-pretraining in automated research loops, where numerous candidate recipes must be evaluated on shared accelerators before committing to larger search budgets. The author proposes a staged fractional-factorial workflow aimed at identifying stable early performance differences under strict time constraints. The methodology involves running a series of experiments on a single-GPU training loop, testing various configurations across different time budgets (2, 5, and 10 minutes) and employing a range of techniques including full condition reruns and targeted anchor checks. The findings reveal that early penalties from total batch size, depth, and width are most significant at shorter budgets but diminish as the budget increases. The analysis shows that certain configurations (D, A, B, and C) maintain significant effects at 5 and 10 minutes, while others do not. The results support a structured approach to screening, confirming promising configurations, and refining them within a reduced space, ultimately recommending a bridge-centered methodology for effective micro-pretraining.
Methodology
The methodology involves a staged factorial screening process that includes initial screening experiments, focused confirmation of promising configurations, and local refinement within a reduced parameter space. The experiments are conducted on a single-GPU training loop with varying time budgets, using a combination of pilot and follow-up screens, targeted checks, and comparisons against random baselines.
Results
The results indicate that main penalties from batch size and model depth are most pronounced at shorter budgets and relax as the budget increases. Configurations D, A, B, and C show non-zero estimates after correction for false discovery rates, while configuration E does not. The bridge-centered approach consistently yields the lowest sample mean across different hosts, suggesting its effectiveness in identifying high-performing configurations.
Implications
The findings suggest that a structured approach to micro-pretraining can enhance the efficiency of automated research workflows, allowing for better early-stage decision-making in hyperparameter tuning. This methodology could be applied in various machine learning contexts where budget constraints are a concern, potentially leading to more effective model training and optimization strategies.
ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
NLP
Large Language Models
Theory
- ERRORQUAKE-10K benchmark scores model responses on a continuous severity scale, revealing nuanced error distributions.
- A Non-Reducibility Theorem proves that error severity profiles and error rates are informationally non-redundant.
- Significant differences in error severity distributions exist among models with similar accuracy, indicating the need for more detailed evaluation metrics.
- Human validation confirms the reliability of the severity scoring system, with high inter-rater agreement.
Read more
ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
Summary
This paper introduces ERRORQUAKE-10K, a novel benchmark designed to evaluate the error severity distributions of open-weight large language models (LLMs). Unlike traditional metrics that report a single error rate, ERRORQUAKE-10K scores model responses on a continuous severity scale from 0 to 4, capturing the nuanced differences in error types and their potential impacts. The study analyzes 21 open-weight models across 10,000 queries, revealing significant disparities in error severity distributions even among models with similar accuracy rates. The author establishes a Non-Reducibility Theorem, demonstrating that error severity profiles provide unique information beyond mere error rates. A human validation study confirms the reliability of the severity scoring system, and a taxonomy of error types shows that low-severity errors are predominantly retrieval mistakes, while high-severity errors often involve fabrications. The findings suggest that reporting error severity alongside accuracy is crucial for a comprehensive understanding of model performance.
Methodology
The study employs a dual-judge scoring system to evaluate model responses on a continuous 0-4 severity scale, validated through a 519-item human study. The ERRORQUAKE-10K benchmark consists of 10,000 queries across eight domains, stratified by difficulty. The severity distribution for each model is characterized, and statistical analyses are conducted to assess the relationships between error severity and model characteristics.
Results
The analysis of 210 model pairs reveals that 85 pairs have disjoint 95% confidence intervals for severity distribution indices at matched accuracy levels. The study finds a strong correlation between model size and severity distribution, with larger models exhibiting heavier severity tails. The human validation study shows high reliability in severity scoring, with an inter-rater correlation of 0.85.
Implications
The findings suggest that evaluating LLMs should include error severity metrics alongside traditional accuracy measures, as this provides deeper insights into model performance and potential risks in deployment. This approach could lead to improved model selection and deployment strategies in real-world applications.
Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data
Theory
- Counterfactuals can reveal sensitive information and are vulnerable to privacy attacks.
- Membership inference attacks designed for synthetic data can be adapted to counterfactuals.
- Successful MIAs can be conducted without querying the model, using only the released counterfactuals.
- The study introduces an ensembling MIA that operates in a no-box setting, expanding the attack landscape.
Read more
Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data
Summary
This paper investigates the privacy implications of counterfactuals, which are often used in machine learning to explain model decisions by illustrating how changes to user profiles can lead to different outcomes. While counterfactuals serve as valuable tools for understanding model behavior, they can also be exploited by adversaries to conduct privacy attacks, particularly through membership inference attacks (MIAs). The authors draw parallels between counterfactuals and synthetic data, suggesting that MIAs developed for synthetic data can be adapted to counterfactuals. They demonstrate that it is possible to perform MIAs on counterfactuals without direct access to the underlying model, a significant shift from traditional approaches that require model queries. The study evaluates the effectiveness of various MIAs against counterfactuals generated by state-of-the-art mechanisms and introduces an ensembling MIA that operates in a no-box setting. The findings highlight the need for caution among model developers when releasing counterfactuals, as they may inadvertently expose sensitive information about the training data or model.
Methodology
The authors implemented an ensembling membership inference attack (MIA) against counterfactuals generated by various state-of-the-art counterfactual generation mechanisms. They compared the effectiveness of this attack with a counterfactual distance attack specifically designed for counterfactuals. The study focused on a no-box attack setting, where only the counterfactuals are available to the adversary, without direct access to the model.
Results
The results demonstrated that the ensembling MIA was effective in identifying membership information from counterfactuals, even in the absence of model queries. This finding indicates that counterfactuals can pose significant privacy risks, and existing protections against MIAs may not be sufficient.
Implications
The implications of this research are critical for the development and deployment of machine learning models that utilize counterfactuals, particularly in sensitive applications. It suggests that developers need to implement stricter privacy measures when releasing counterfactuals to prevent potential privacy breaches.
Generalized TVโโp Structured Priors for Bayesian T1 Mapping
Theory
- Introduction of a generalized TVโโp prior for Bayesian T1 mapping.
- Demonstrated properness of the proposed prior and its integration into a Bayesian framework.
- Evaluation against multiple existing methods shows improved uncertainty quantification.
- Results indicate lower variance and bias in estimates, enhancing reliability.
Read more
Generalized TVโโp Structured Priors for Bayesian T1 Mapping
Summary
This paper introduces a novel family of structured spatial priors that combine total variation (TV) functions with โp norms, aimed at enhancing Bayesian T1 mapping in MRI. The proposed TVโโp prior is shown to be a proper distribution and is integrated into a Bayesian regression framework, allowing for effective uncertainty quantification in T1 mapping. The authors utilize the No-U-Turn Sampler (NUTS) for posterior inference. The method is rigorously evaluated against maximum-likelihood estimation and various Bayesian alternatives, including uniform, Gamma, and bounded TV priors, using both synthetic and real datasets. The results demonstrate that the TVโโp prior leads to more concentrated posterior densities, indicating reduced uncertainty, lower variance, and smaller bias in parameter estimates. This approach significantly improves spatial coherence in T1 maps and enhances the reliability of uncertainty quantification, making it a robust method for T1 mapping in medical imaging.
Methodology
The authors developed a Bayesian regression framework incorporating a generalized TVโโp structured prior. Posterior inference was performed using the No-U-Turn Sampler (NUTS). The method was evaluated through experiments on synthetic and real datasets, comparing it with maximum-likelihood estimation and various Bayesian priors.
Results
The TVโโp prior resulted in more concentrated posterior densities, indicating reduced uncertainty in estimates. The method consistently achieved lower variance and smaller negative bias compared to other approaches, leading to more reliable T1 mapping results.
Implications
The proposed method enhances the accuracy and reliability of T1 mapping in MRI, which is crucial for diagnosing and monitoring various medical conditions. The improved uncertainty quantification can lead to better clinical decision-making and patient outcomes.
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
Time Series
Efficient ML
- Tabular Foundation Models can effectively handle fragmented and partially observed industrial PHM data.
- In-context learning allows for task adaptation without retraining, reducing deployment overhead.
- The proposed models outperform traditional methods in low-data regimes and across various PHM tasks.
- Temporal context can be preserved in tabular representations, enhancing model performance.
Read more
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
Summary
This paper addresses the challenges of applying data-driven Prognostics and Health Management (PHM) in industrial settings, where data is often fragmented and poorly labeled. The authors propose a framework that utilizes Tabular Foundation Models for PHM tasks, particularly focusing on in-context learning to adapt to varying data conditions without the need for extensive retraining. By converting raw time-series data into tabular formats, the study demonstrates that these models can effectively perform prognostics and diagnostics across multiple tasks. The evaluation compares the proposed models against traditional sequence models, transformer baselines, and gradient-boosted trees, revealing that tabular foundation models achieve superior performance in low-data scenarios. The findings highlight the potential of these models to generalize across different assets and adapt to changing conditions, making them a viable solution for enhancing PHM practices in industry.
Methodology
The authors developed a framework that converts raw unit-level time-series signals into tabular representations. They employed Tabular Foundation Models and evaluated their performance on various PHM tasks using in-context learning. The models were compared with sequence models, transformer baselines, and gradient-boosted trees under a unified evaluation protocol.
Results
The results indicated that tabular foundation models achieved the best average ranks across prognostic and diagnostic tasks. They demonstrated competitive performance in low-data environments and showed that effective context construction is crucial for maintaining performance under subsampling conditions.
Implications
The findings suggest that tabular foundation models can significantly improve the scalability and efficiency of PHM systems in industrial applications. By enabling effective adaptation to varying operational conditions and reducing the need for extensive retraining, these models could facilitate broader adoption of advanced PHM techniques, ultimately leading to reduced downtime and maintenance costs.
Non-Negative Matrix Factorization for Event Data
Time Series
- Introduction of EventNMF, a continuous-time NMF model for event data.
- Directly models event times as Poisson processes, avoiding preprocessing pitfalls.
- Utilizes non-negative B-spline basis for latent temporal factors.
- Demonstrates effectiveness on synthetic and real-world datasets.
Read more
Non-Negative Matrix Factorization for Event Data
Summary
This paper introduces EventNMF, a novel continuous-time non-negative matrix factorization (NMF) model designed specifically for event data, which consists of instantaneous events emitted by entities over time. Traditional applications of NMF to event data typically involve preprocessing steps like binning or smoothing, which can obscure important temporal features and entity-level heterogeneities. EventNMF addresses these issues by modeling each entity's events as a Poisson process, where the intensity of events is expressed as a non-negative linear combination of latent temporal factors represented by a non-negative B-spline basis. The paper provides a mathematically principled framework that is computationally efficient and easy to implement. The authors derive efficient multiplicative updates for parameter estimation and demonstrate the model's effectiveness through evaluations on synthetic datasets with known latent factors, as well as real-world applications including neuronal spike train recordings, earthquake event analysis, and social network interaction data. The results indicate that EventNMF can uncover interpretable temporal patterns without the biases introduced by traditional binning methods.
Methodology
EventNMF operates by modeling the event intensity of each entity as a Poisson process, where the intensity is a non-negative linear combination of latent factors. These factors are represented using a non-negative B-spline basis. The authors derive efficient multiplicative updates for parameter estimation, allowing for direct analysis of event data without prior binning or smoothing.
Results
The evaluation of EventNMF on synthetic datasets showed its ability to recover known latent factors accurately. In real-world applications, it effectively identified temporal patterns in neuronal spike train recordings, earthquake events, and social interactions, outperforming traditional binned-count approaches.
Implications
EventNMF has significant implications for various fields that analyze event data, such as neuroscience, seismology, and social network analysis. By providing a method that retains the fine-grained temporal features of event data, it can enhance the understanding of underlying processes and dynamics in these domains.
Trust-Aware Predictive Emissions Monitoring for Gas Turbine Fleets with Limited Labelled Data
Time Series
- Introduces a trust-aware framework for emissions prediction in gas turbine fleets with limited labelled data.
- Combines multiple techniques for uncertainty quantification and confidence estimation to assess prediction reliability.
- Demonstrates significant reduction in prediction error through confidence-based filtering.
- Provides actionable insights for deploying predictive emissions monitoring systems in industrial settings.
Read more
Trust-Aware Predictive Emissions Monitoring for Gas Turbine Fleets with Limited Labelled Data
Summary
This paper presents a novel trust-aware probabilistic framework for predictive emissions monitoring in gas turbine fleets, particularly in scenarios where emissions labels are scarce. The proposed framework integrates a multi-head recurrent prediction model with various components including confidence estimation, ensemble-based uncertainty quantification, and auxiliary feature prediction. By calibrating these components on available labelled data, the framework generates interpretable trust scores for each prediction, allowing for reliable assessments of prediction quality on unlabelled turbines. The authors demonstrate that confidence-based filtering significantly reduces the mean absolute error (MAE) of predictions, indicating a strong correlation between confidence estimates and prediction accuracy. The framework effectively identifies increased uncertainty in unlabelled and out-of-distribution samples, thus responding appropriately to distributional shifts. Overall, the proposed approach enhances the transparency and trustworthiness of predictive emissions monitoring systems across industrial fleets, enabling more informed decision-making.
Methodology
The methodology involves a multi-head recurrent prediction model that incorporates learned confidence estimation and ensemble-based uncertainty quantification. The framework calibrates various signals such as predictive uncertainty and feature-space distance analysis on labelled data to produce trust scores for predictions on unlabelled turbines.
Results
The results indicate that confidence-based filtering reduces the mean absolute error (MAE) from 0.202 to 0.070 for the highest-confidence 10% of predictions. The framework also effectively identifies increased uncertainty in unlabelled and out-of-distribution samples, confirming its robustness to distributional shifts.
Implications
The implications of this work extend to the deployment of predictive emissions monitoring systems in industrial contexts, where reliable emissions predictions are crucial. The trust-aware framework can enhance operational decision-making by providing transparency and reliability assessments, ultimately supporting regulatory compliance and environmental sustainability.
On Advantage Estimates for Max@K Policy Gradients
Reinforcement Learning
Large Language Models
Optimization
- Introduces a Leave-Two-Out baseline for policy-gradient estimators that ensures centered advantages.
- Develops MaxPO, an efficient method for optimizing max@K objectives in reinforcement learning.
- Provides a unified framework for understanding existing advantage estimators in the context of max@K.
- Empirical results show significant reductions in gradient variance and improved performance on reasoning tasks.
Read more
On Advantage Estimates for Max@K Policy Gradients
Summary
This paper addresses the challenges of reinforcement learning (RL) with sparse rewards, particularly in the context of optimizing inference-time objectives like pass@K and max@K. Existing policy-gradient estimators for these objectives often use different signals, leading to confusion about their relationships. The authors investigate this issue through the design of baselines and advantage centering. They introduce a Leave-Two-Out (L2O) baseline that maintains policy-gradient unbiasedness while ensuring that realized batch advantages are centered. This results in a new method called MaxPO (Max@K Policy Optimization), which features an efficient quadratic-time implementation and is suitable for group-based RL in large language model (LLM) post-training. The paper also derives a canonical finite-batch advantage for max@K, providing a unified perspective on existing estimators. Empirical results demonstrate that the L2O baseline significantly reduces gradient variance and enhances performance on LLM reasoning tasks, outperforming traditional methods.
Methodology
The authors analyze existing policy-gradient estimators for max@K objectives, focusing on the quality of advantage estimators. They propose a new L2O baseline that preserves unbiasedness while centering the advantage. The method is implemented efficiently in quadratic time and integrated into group-based RL frameworks. The paper also derives a canonical finite-batch advantage for max@K, facilitating comparison among different estimators.
Results
The proposed L2O baseline achieved an average reduction of 77.4% in gradient variance during training compared to existing methods. Additionally, it improved task-average pass@256 performance by 5.2% on one model and 2.4% on another across multiple math reasoning benchmarks, consistently outperforming strong baselines.
Implications
The findings suggest that optimizing inference-time objectives directly can enhance the efficiency and effectiveness of reinforcement learning in large language models. The proposed methods could lead to better exploration strategies and improved performance in various reasoning tasks, making them valuable for future research and applications in AI.
Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
Large Language Models
Optimization
Efficient ML
- Discovery of a dominant-layer phenomenon in ZO fine-tuning, where tuning a single layer can match or exceed full-model performance.
- The dominant layer is task-agnostic but model-specific, identifiable through activation outlier analysis.
- Perturbation effects propagate effectively through the dominant layer, enhancing optimization signals under ZO updates.
- Dominant-layer ZO fine-tuning shows improved performance and significant training speedup compared to existing methods.
Read more
Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
Summary
This paper investigates the phenomenon of zeroth-order (ZO) optimization in the context of fine-tuning large language models (LLMs). The authors reveal that during ZO fine-tuning, a single decoding layer, referred to as the 'dominant layer', significantly influences the adaptation process, often achieving performance comparable to or exceeding that of full-model ZO fine-tuning. The dominant layer is identified as task-agnostic yet model-specific, and its identification can be efficiently performed using a simple analysis of activation outliers in the pre-trained model. The study explains that the dominant layer's effectiveness stems from its high sensitivity to perturbations and its early position in the residual stream, which allows perturbation effects to propagate through subsequent layers. Extensive experiments conducted on LLaMA2-7B and Qwen3-8B across nine benchmarks demonstrate that fine-tuning solely the dominant layer not only improves performance but also accelerates training speed, achieving up to 4.52 times faster training compared to full-model methods.
Methodology
The authors conducted a systematic layer-wise analysis of multiple LLMs, fine-tuning one layer at a time while freezing others. They identified the dominant layer based on activation outlier patterns and analyzed the propagation of perturbation effects under ZO optimization.
Results
Experiments on LLaMA2-7B and Qwen3-8B demonstrated that dominant-layer ZO fine-tuning improved average performance by 0.86% over full-model ZO fine-tuning and 0.61% over LoRA-based ZO fine-tuning, with training speedups ranging from 1.12ร to 4.52ร.
Implications
The findings suggest that focusing on a single dominant layer can significantly reduce training costs while maintaining model performance. This insight may inform future designs of ZO optimization methods, emphasizing the importance of layer-specific adaptations.
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
Large Language Models
Efficient ML
NLP
- Introduces a multiplication-only algorithm for matrix inversion in Gated DeltaNet, enhancing computational efficiency.
- Utilizes a truncated Neumann series with structural masking to eliminate sequential dependencies.
- Achieves up to 5ร speedup in kernel execution on NPUs without sacrificing accuracy.
- Adapts the method for low-bit integer quantization, addressing dynamic range issues.
Read more
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
Summary
This paper addresses the computational bottleneck of matrix inversion in chunk-wise parallel linear attention, particularly in the context of Gated DeltaNet models used for long-context modeling. The authors propose a novel algorithm that utilizes a multiplication-only approach to approximate matrix inversion for strictly lower-triangular matrices, which are common in chunk-wise linear attention. By leveraging a truncated Neumann series expansion with structural masking and parallel residual correction, the proposed method eliminates sequential dependencies, thus enhancing parallelism and hardware utilization on NPUs. The authors also adapt their method for low-bit integer quantization, effectively managing the dynamic range expansion that arises from repeated matrix power operations. Experimental results demonstrate significant improvements, including up to 5ร speedup in kernel execution and a 20% reduction in decode-layer overhead, while maintaining accuracy in both floating-point and low-precision settings. This work presents an efficient and scalable solution for linear attention mechanisms in large models.
Methodology
The authors reformulate the matrix inversion problem as an approximate computation using a low-order Neumann series expansion. They implement structured diagonal masking and parallel residual correction to preserve accuracy while enabling dense matrix multiplications. The method is designed to be compatible with low-bit integer operations, mitigating issues related to dynamic range expansion.
Results
The proposed method achieves up to 5ร speedup in kernel-level execution and a 20% reduction in decode-layer overhead in experiments conducted on Qwen3.5-family models. The accuracy of the model is preserved under both floating-point and low-precision inference conditions.
Implications
This work has significant implications for the design of efficient algorithms in large-scale models, particularly in natural language processing tasks that require long-context capabilities. The proposed method can enhance the performance of models deployed on hardware with limited computational resources, making it suitable for real-time applications.
LLM Explainability with Counterfactual Chains and Causal Graphs
NLP
Large Language Models
Interpretability
- Introduces a causal paradigm for explainability in LLMs using causal graphs.
- Develops an MCMC-inspired method for generating counterfactuals to enhance data coverage.
- Evaluates the approach on three LLMs across various classification tasks.
- Demonstrates that discovered causal graphs reflect meaningful dependencies in LLM reasoning.
Read more
LLM Explainability with Counterfactual Chains and Causal Graphs
Summary
This paper introduces a novel approach to explainability in Large Language Models (LLMs) by utilizing causal graphs to model LLM inference processes. The authors propose a four-phase methodology that begins with generating class predictions from a target LLM and extracting class-discriminative concepts. They then employ a Markov Chain Monte Carlo (MCMC)-inspired counterfactual augmentation procedure to enhance the dataset by generating counterfactual examples, which leads to more stable causal discovery. The resulting causal graphs provide a transparent view of how LLMs perceive and organize concepts to arrive at predictions. The method is evaluated across three LLMs in tasks such as disease diagnosis and sentiment analysis, demonstrating that the discovered causal graphs effectively capture meaningful dependencies consistent with the reasoning of LLMs. This work aims to bridge the gap between opaque LLM decision-making and the need for interpretable, concept-level explanations.
Methodology
The proposed methodology consists of four phases: (1) generating class predictions from the LLM for a set of textual examples, (2) extracting class-discriminative concepts and mapping each example to concept states, (3) employing an MCMC-inspired counterfactual augmentation to expand the concept space, and (4) constructing causal graphs that represent the relationships between concepts and predictions.
Results
The experiments reveal that the causal graphs constructed through the proposed method capture significant dependencies in LLM reasoning. The MCMC-inspired augmentation leads to improved graph accuracy and stability, with discovered causal parents serving as strong predictors for each node. The results indicate a divergence in reasoning strategies between structured synthetic tasks and naturalistic data, highlighting the adaptability of LLMs.
Implications
This work has significant implications for enhancing the interpretability of LLMs, particularly in high-stakes domains such as healthcare and law, where understanding the reasoning behind model predictions is crucial. The framework can aid stakeholders in gaining insights into LLM decision-making processes, potentially increasing trust and accountability in AI systems.
Consistency Training Along the Transformer Stack
NLP
Large Language Models
Theory
- Introduction of two new consistency training methods: MLPCT and AttCT.
- Application of consistency training to four new threat models, enhancing model robustness.
- Discovery of cross-threat generalization, where training on one threat improves performance on others.
- Identification of a shared mechanism among new methods, with BCT operating distinctly.
Read more
Consistency Training Along the Transformer Stack
Summary
This paper explores the concept of consistency training in transformer models, aiming to enhance model alignment by ensuring consistent behavior across varying contexts. The authors introduce two novel internal consistency targets: MLP Consistency Training (MLPCT), which focuses on matching post-activation states of MLP layers, and Attention Consistency Training (AttCT), which aligns per-head attention distributions. Additionally, the paper expands the application of consistency training to four new safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. The findings indicate that the proposed methods significantly reduce misalignment across various models and threat scenarios, demonstrating cross-threat generalization where training on one type of threat improves robustness against others. The study also reveals a shared residual-stream mechanism among the new methods while distinguishing BCT as a distinct approach. Overall, the results suggest that consistency training is a versatile framework for addressing a broader range of model pathologies.
Methodology
The authors developed two new consistency training methods, MLPCT and AttCT, which penalize discrepancies in MLP post-activation states and attention distributions, respectively. They applied these methods to various threat models and evaluated their effectiveness in reducing misalignment. The study utilized a consistency training framework that formalizes the relationship between perturbation sources, agreement targets, and disagreement metrics.
Results
The results showed that both MLPCT and AttCT performed competitively across all tested threat models, with some datasets showing significant improvements over baseline methods. The study also found that models trained on one type of threat exhibited improved robustness to others, indicating effective cross-threat generalization. The shared pathway mechanism among the new methods was highlighted, while BCT was noted for its distinct operational characteristics.
Implications
The findings suggest that consistency training can serve as a comprehensive approach to enhance the alignment of language models, potentially leading to safer and more reliable AI systems. This framework could be applied in various contexts where model behavior needs to be controlled and aligned with desired outcomes, particularly in sensitive applications.
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Large Language Models
Optimization
Theory
- Introduction of the PC layer for polynomial weight preconditioning in LLMs.
- Empirical results show significant improvements in training efficiency and accuracy.
- Theoretical proof linking weight spectrum control to convergence rates in deep linear networks.
- No additional inference cost after training, making it practical for real-world applications.
Read more
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Summary
This paper introduces a novel preconditioning (PC) layer designed to enhance the training of large language models (LLMs) by stabilizing weight conditioning through polynomial preconditioning. The PC layer reshapes the singular-value spectrum of weight matrices, ensuring that the weights remain well-conditioned throughout the training process. This is achieved without incurring additional computational costs during inference, as the preconditioned weights can be integrated back into the original architecture post-training. The authors provide theoretical justification for their approach, demonstrating that bounding the singular values of each layer leads to geometric convergence of gradient descent to global minima in certain deep linear networks. Empirical evaluations on Llama-1B and Llama-271M models show that the PC layer significantly improves token efficiency and zero-shot accuracy, outperforming standard transformer architectures when using both AdamW and Muon optimizers. The findings suggest that controlling the weight spectrum is crucial for optimizing LLM training, paving the way for further advancements in weight-based normalization techniques.
Methodology
The authors propose a PC layer that utilizes low-degree polynomial preconditioning to reshape the singular-value spectrum of weight matrices. This method avoids computationally expensive matrix decompositions by directly modifying the singular values through polynomial functions. The approach is integrated into the training of LLMs and evaluated using standard optimizers.
Results
The PC layer demonstrated consistent improvements in token efficiency, achieving the same loss with fewer training tokens (2ร with AdamW and 1.13ร with Muon). Additionally, it enhanced zero-shot downstream accuracy and improved weight-spectrum conditioning, validating its effectiveness over standard transformer architectures.
Implications
The findings suggest that incorporating polynomial preconditioning can lead to more efficient training of large language models, potentially influencing future designs of normalization techniques in neural networks. This could enhance the performance of LLMs in various applications, including natural language processing tasks.
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Reinforcement Learning
Robotics
Efficient ML
- Representation learning is essential for scalable multitask reinforcement learning.
- The MR.Q algorithm outperforms model-based methods and deep RL baselines.
- Increased model capacity leads to consistent performance improvements.
- Predictive objectives are critical for effective representation learning.
Read more
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Summary
This paper addresses the challenge of scaling reinforcement learning (RL) to diverse multitask settings, arguing that representation learning is the key driver for scalability rather than model-based control. The authors propose a model-free algorithm called MR.Q, which integrates predictive, model-based representations with high-capacity value function approximation. This approach is evaluated in a multitask continuous control environment and demonstrates superior performance compared to a recent world-model-based method and various deep RL baselines. The results indicate that MR.Q not only reduces computational overhead but also enhances wall-clock efficiency. The study reveals that increasing model capacity consistently improves performance and that predictive representation learning is crucial for achieving these gains. The findings suggest that focusing on representation quality can lead to significant advancements in multitask RL, providing a more efficient alternative to complex planning methods.
Methodology
The authors developed MR.Q, a model-free actor-critic algorithm that incorporates auxiliary predictive objectives. They conducted experiments across a suite of multitask continuous control benchmarks, evaluating sample efficiency and performance at reduced training steps (10M) to isolate the effects of representation learning from prolonged training benefits.
Results
MR.Q outperformed the world-model-based method Newt and various deep RL baselines, achieving better sample efficiency and wall-clock performance. The algorithm also demonstrated stronger transfer capabilities to unseen tasks and highlighted the importance of predictive objectives, as performance significantly degraded without them.
Implications
The findings suggest that improving representation quality can enhance the efficiency and effectiveness of multitask RL systems, potentially leading to broader applications in robotics, game playing, and other domains where multitask learning is beneficial.
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Graph Learning
Time Series
Multimodal
- Developed an unsupervised machine learning framework for identifying Huntington's disease stages.
- Utilized graph-based representation learning to capture temporal relationships in longitudinal clinical data.
- Discovered four meaningful disease stages with clear clinical measurement boundaries.
- Achieved robust clustering performance, surpassing traditional clinical staging methods.
Read more
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Summary
This paper presents an innovative machine learning framework aimed at identifying stages of Huntington's disease (HD) using longitudinal data from the Enroll-HD dataset. Traditional clinical staging methods rely on predefined thresholds and expert assessments, which can obscure variability and introduce inter-rater differences. To overcome these limitations, the authors developed an unsupervised machine learning approach that utilizes graph-based representation learning to capture temporal relationships in clinical data. The framework encodes longitudinal clinical data into compact latent representations, allowing for the identification of disease stages without relying on established clinical measurements. K-means++ clustering was employed to determine the number of distinct groups, with stability analysis confirming the robustness of the identified clusters. The study analyzed data from 302 individuals, resulting in the discovery of four meaningful disease stages, each corresponding to distinct clinical measurement boundaries. The framework demonstrated strong clustering performance, with a Silhouette score of 0.67 and a DaviesโBouldin index of 0.56, indicating clear separations between stages. This data-driven approach provides a more objective foundation for HD staging and has the potential to incorporate multimodal biomarkers for deeper insights into disease progression.
Methodology
The authors employed an unsupervised machine learning framework that integrates graph-based representation learning with K-means++ clustering. This approach encodes longitudinal clinical data into latent representations, capturing evolving disease patterns without relying on predefined clinical measurements. Stability analysis was conducted to assess the robustness of the identified clusters.
Results
The framework analyzed data from 302 individuals, resulting in the identification of four distinct disease stages. The clustering performance metrics included a Silhouette score of 0.67, a DaviesโBouldin index of 0.56, and a CalinskiโHarabasz score of 453, indicating effective separation between stages. The identified stages aligned with clinically significant boundaries, demonstrating minimal overlap compared to existing clinical staging methods.
Implications
This framework provides a more objective and data-driven method for staging Huntington's disease, potentially leading to improved patient grouping, personalized care, and treatment strategies. The integration of multimodal biomarkers could further enhance understanding of disease progression.
Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Large Language Models
Efficient ML
- Tangram improves multi-turn LLM serving efficiency by addressing non-uniform KV cache challenges.
- The system utilizes deterministic memory scheduling to eliminate runtime overhead.
- It employs a decoupled paging architecture to maximize memory reclamation and reduce fragmentation.
- Ahead-of-Time load balancing ensures uniform GPU utilization without runtime planning delays.
Read more
Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Summary
The paper presents Tangram, a novel serving system designed to enhance the efficiency of multi-turn Large Language Model (LLM) serving by addressing the challenges posed by non-uniform Key-Value (KV) caches. Multi-turn interactions require the retention of extensive dialogue history, leading to significant memory consumption and bandwidth strain on GPUs. While non-uniform KV compression allows for more efficient retention of critical information by varying the cache size per attention head, it introduces systemic inefficiencies such as memory fragmentation and scheduling complexities. Tangram tackles these issues through three innovative techniques: (1) Deterministic Budget Allocation, which assigns static memory footprints to attention heads to eliminate dynamic scheduling overhead; (2) Head Group Page, which clusters attention heads with similar retention needs into independent page tables to maximize memory reclamation; and (3) Ahead-of-Time Load Balancing, which precomputes optimal GPU workload distributions to ensure uniform utilization. Experimental results demonstrate that Tangram achieves up to 2.6ร throughput improvement over existing systems while maintaining model accuracy, making it a significant advancement in LLM serving efficiency.
Methodology
Tangram employs three core techniques: Deterministic Budget Allocation for static memory assignment, Head Group Page for clustering attention heads into independent page tables, and Ahead-of-Time Load Balancing for precomputing GPU workload distributions. These methods leverage the stable retention patterns of attention heads to optimize memory management and scheduling.
Results
Tangram demonstrates a throughput improvement of up to 2.6ร compared to existing baseline systems, while fully preserving the accuracy of the model. The implementation addresses the systemic inefficiencies associated with non-uniform KV caches, leading to enhanced performance in multi-turn LLM serving.
Implications
The advancements presented in Tangram could significantly enhance the scalability and efficiency of AI assistants and other applications requiring multi-turn interactions, potentially leading to better user experiences and more effective deployment of LLMs in real-world scenarios.
CLaaS: Continual learning as a service for sample efficient online learning
NLP
Large Language Models
Reinforcement Learning
- CLaaS enables sample-efficient continual learning from real-world deployment experiences.
- The system utilizes an experience replay buffer to enhance gradient reuse during training.
- CLaaS outperforms traditional in-context learning methods in terms of retention and adaptability.
- The approach facilitates real-time improvements to agent performance through a chat API.
Read more
CLaaS: Continual learning as a service for sample efficient online learning
Summary
The paper introduces CLaaS (Continual Learning as a Service), a novel system designed to enhance the adaptability of large language model (LLM) agents in dynamic environments. As these agents encounter distribution shifts, they must learn from their experiences while retaining prior capabilities. The authors highlight the limitations of traditional in-context learning (ICL) methods, which rely on transient context and can lead to catastrophic forgetting. CLaaS addresses these challenges by enabling agents to learn from a continuous stream of scenarios through an experience replay buffer that supports asynchronous training. This approach allows for sample-efficient learning, where the system can reuse gradients from past experiences to improve performance in real-time. The authors evaluate CLaaS on an adversarial task, demonstrating that it significantly outperforms ICL in terms of forward transfer and retention of learned knowledge. The proposed system not only abstracts continual learning behind a chat API but also establishes a real-time feedback loop for ongoing improvements, marking a significant advancement in the field of online continual learning.
Methodology
The authors propose a continual learning framework that collects on-policy rollouts in an experience replay buffer. This buffer is used for asynchronous training, allowing for gradient reuse and efficient updates to the agent's policy. The system employs a hot-reloading mechanism to implement real-time improvements based on the collected experiences.
Results
In evaluations on an adversarial attack dataset, CLaaS achieved three times the final pass rate and half the forgetting compared to traditional in-context learning methods. This demonstrates the effectiveness of the proposed system in maintaining learned knowledge while adapting to new tasks.
Implications
The CLaaS framework has significant implications for the deployment of autonomous agents in dynamic environments, enabling them to continuously improve their performance based on real-world interactions. This could enhance the reliability and effectiveness of LLMs in various applications, including customer service, content generation, and interactive systems.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
NLP
Large Language Models
Efficient ML
- Fisher importance outperforms traditional metrics for identifying critical model dimensions.
- Fisher-MoE enables fine-grained compression at the intermediate dimension level, preserving model capabilities.
- At a 50% compression ratio, Fisher-MoE reduces memory usage by ~45% and increases inference speed by 21%.
- The study reveals that model capabilities are distributed across experts but concentrated in a small subset of intermediate dimensions.
Read more
Less is MoE: Trimming Experts in Domain-Specialist Language Models
Summary
This paper addresses the challenges of deploying Mixture-of-Experts (MoE) models due to their large parameter footprint. Previous compression methods have failed when evaluated on general-purpose benchmarks, primarily because they operate at the expert level rather than focusing on the intermediate dimensions where model capabilities are concentrated. The authors introduce a novel approach called Fisher-MoE, which utilizes Fisher importance to identify and remove less critical intermediate dimensions within experts. This method allows for fine-grained compression, significantly reducing the model's weight memory by approximately 45% while improving inference throughput by 21%. The study demonstrates that Fisher importance is a superior metric for ranking parameter significance compared to existing heuristics, leading to better preservation of model performance across various tasks, including mathematical reasoning and code generation.
Methodology
The authors employ Fisher importance as a metric to evaluate the significance of intermediate dimensions in MoE models. They propose the Fisher-MoE method, which compresses the model by removing low-Fisher-score dimensions within each expert's feedforward network (FFN), rather than removing entire experts. This approach is validated through controlled experiments on various challenging benchmarks.
Results
The implementation of Fisher-MoE resulted in a significant reduction in model size and improved inference efficiency without sacrificing performance. Specifically, the model maintained its capabilities on general-purpose benchmarks while achieving a 50% compression ratio, demonstrating the effectiveness of targeting intermediate dimensions for compression.
Implications
The findings suggest that future research on MoE models should focus on intermediate dimension-level analysis for compression and optimization. This could lead to more efficient deployment of large language models in practical applications, enhancing their usability in resource-constrained environments.
Your GFlowNet Secretly Learns an Optimal Transport Plan
Generative Models
Graph Learning
Optimization
- Establishes a theoretical link between GFlowNets and optimal transport problems.
- Demonstrates that fixing the initial flow in GFlowNets leads to a Kantorovich OT formulation.
- Shows that GFlowNets can recover optimal transport plans and approximate solutions effectively.
- Expands the GFlowNet framework's applicability to large graph OT problems.
Read more
Your GFlowNet Secretly Learns an Optimal Transport Plan
Summary
This paper establishes a theoretical connection between Generative Flow Networks (GFlowNets) and optimal transport (OT) problems, particularly in non-acyclic graph settings. The authors demonstrate that by fixing the initial flow distribution in a minimum-flow GFlowNet, the problem can be reformulated as a Kantorovich OT problem with graph-induced shortest path costs. This means that the learned GFlowNet policy effectively encodes an optimal transport plan from a source distribution to a target distribution. The authors provide a framework that allows GFlowNets to be applied to large graph OT problems through edge flows and neural parameterization. Experimental results show that GFlowNets can recover exact OT solutions when feasible and can efficiently approximate solutions in larger combinatorial spaces where exact methods become impractical. This work expands the applicability of GFlowNets beyond traditional sampling tasks to include computational optimal transport problems.
Methodology
The authors reformulate the minimum-flow GFlowNet learning problem as a linear programming problem. They establish that fixing the initial edge-flow distribution transforms the objective into a Kantorovich OT problem. The methodology involves using neural parameterization to learn policies that recover optimal couplings in graph structures, leveraging the properties of non-acyclic GFlowNets.
Results
The results indicate that GFlowNets can effectively recover exact solutions to OT problems when they are tractable and can scale to approximate solutions in larger combinatorial spaces. The experiments validate the theoretical findings, showing that the GFlowNet policies align with optimal transport plans and achieve comparable transport costs.
Implications
This research has significant implications for various applications in machine learning, particularly in areas requiring efficient sampling and transport of structured data, such as molecule generation, biological sequence design, and combinatorial optimization. The ability to apply GFlowNets to optimal transport problems opens new avenues for research and practical applications in graph-based learning tasks.
Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
Optimization
Theory
- Introduces dimension-free first-order lower bounds for higher-order smooth nonconvex functions.
- Establishes matching lower bounds of โฆ(ฯตโ7/4) for Hessian-Lipschitz functions and โฆ(ฯตโ5/3) for third-order-smooth functions.
- Utilizes a block-chain mechanism for constructing hard instances that preserve smoothness.
- Closes long-standing gaps in the lower-bound landscape for first-order oracle complexity.
Read more
Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
Summary
This paper investigates the first-order oracle complexity of locating ฯต-stationary points in smooth nonconvex optimization, particularly under higher-order smoothness conditions. While existing methods achieve optimal rates of ฯตโ2 with Lipschitz gradients, the introduction of higher-order smoothness allows for accelerated rates, such as ฯตโ7/4 with Lipschitz Hessians and ฯตโ5/3 with Lipschitz third derivatives. However, the matching lower bounds for these rates had not been established until now. The author presents a new dimension-free lower bound for higher-order smooth nonconvex functions, effectively closing the gap between previous lower bounds and the newly established upper bounds. The construction of the lower bound utilizes a block-chain mechanism that maintains smoothness while enforcing blockwise oracle revelation. This innovative approach not only provides the matching lower bounds but also enhances the understanding of the complexity landscape in higher-order smooth optimization.
Methodology
The author develops a new lower-bound construction for higher-order smooth nonconvex functions using a block-chain mechanism. This mechanism allows for the enforcement of blockwise oracle revelation while maintaining the necessary smoothness structure. The construction is dimension-free and applies to any finite smoothness order, providing a robust framework for analyzing first-order oracle complexity.
Results
The paper successfully proves dimension-free lower bounds for higher-order smooth nonconvex optimization, specifically achieving โฆ(ฯตโ7/4) for Hessian-Lipschitz functions and โฆ(ฯตโ5/3) for functions with Lipschitz third derivatives. These results match the best-known upper bounds, thereby resolving existing gaps in the literature.
Implications
The findings have significant implications for the optimization community, particularly in the development of efficient algorithms for nonconvex problems. By establishing sharp lower bounds, the paper provides a clearer understanding of the limits of first-order methods under higher-order smoothness, guiding future research and algorithm design.
End-to-End Subgraph Detection with GraphDETR
Graph Learning
- GraphDETR reformulates subgraph detection as a set prediction problem, enhancing efficiency and scalability.
- The framework allows for both exact and approximate matching of subgraphs, overcoming limitations of traditional methods.
- GraphDETR achieves strong performance in detecting molecular functional groups, with an average precision of 91.2 on the ChEMBL dataset.
- The model's architecture integrates GNNs and transformer-based set prediction, providing a unified and end-to-end solution.
Read more
End-to-End Subgraph Detection with GraphDETR
Summary
The paper introduces GraphDETR, a novel deep learning framework for subgraph detection, which reformulates the problem as a set prediction task similar to DETR in object detection. Traditional combinatorial methods for subgraph isomorphism are limited by their NP-completeness and are only effective for small patterns or graphs. GraphDETR employs a graph neural network (GNN) to encode the target graph and utilizes a fixed set of learnable query vectors decoded through a transformer decoder to predict all occurrences of query patterns in a single forward pass. This end-to-end training approach, using bipartite matching, allows for both exact and approximate matching of subgraphs, enabling the detection of diverse patterns such as molecular structures and fuzzy patterns in larger graphs. The framework is evaluated on molecular functional group detection using the ChEMBL dataset, achieving a high average precision score, demonstrating its effectiveness and potential for broader applications in graph analysis.
Methodology
GraphDETR encodes the target graph using a graph neural network and employs a set of learnable query vectors that interact with the graph representation via a transformer decoder. The model is trained end-to-end using a bipartite matching loss to ensure accurate predictions of subgraph occurrences without duplicates.
Results
GraphDETR demonstrated strong performance in detecting molecular functional groups, achieving an average precision score of 91.2 on the ChEMBL dataset. The framework effectively identified various patterns, including cycles and cliques, in graphs containing up to 1000 nodes.
Implications
The introduction of GraphDETR has significant implications for various scientific domains, particularly in molecular analysis and network science. Its ability to detect complex subgraph patterns efficiently can enhance tasks such as reaction prediction, retrosynthesis planning, and molecular property modeling.
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
Robotics
- Introduction of A4D framework for affordance reasoning in robot planning.
- Mapping of visual observations to a functional latent space based on object functionalities.
- Achievement of 94% inference accuracy on existing affordances, outperforming previous methods.
- Improvement of new-affordance inference accuracy from ~70% to over 90% with minimal training data.
Read more
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
Summary
This paper addresses the limitations of existing robot planning systems that rely on appearance-based reasoning, which often fails to capture the functional capabilities of objects. The authors introduce A4D, a framework that maps visual observations into a functional latent space structured around affordances, enabling robots to reason about task-relevant functionalities (e.g., whether an object is 'movable') rather than just its appearance. A4D incorporates an affordance discovery mechanism that expands the latent space to accommodate unseen scenarios, allowing for uncertainty quantification in affordance inference. The evaluation of A4D across various planning tasks demonstrates its effectiveness, achieving high inference accuracy and significantly faster processing times compared to state-of-the-art methods. This approach not only enhances the generalizability of robot-object interactions but also improves the efficiency of real-time planning.
Methodology
The A4D framework utilizes a shared functional latent space to map visual observations and affordances. It employs an image encoder and a text encoder to project observations into this space. The framework quantifies uncertainty in affordance inference and selectively triggers an affordance discovery mechanism when existing affordances are insufficient. This allows for the expansion of the latent space to include new affordances based on proximity measurements.
Results
A4D achieved 94% accuracy in inferring existing affordances, surpassing state-of-the-art methods by over 20 percentage points. It also improved the accuracy of new-affordance inference from approximately 70% to over 90% using less than 10% of the original training data. Additionally, A4D enabled inference speeds that are 100 times faster than previous approaches.
Implications
The findings suggest that A4D can significantly enhance the adaptability and efficiency of robotic systems in dynamic environments, allowing for more effective interaction with novel objects. This has potential applications in various fields, including autonomous robotics, human-robot interaction, and assistive technologies.
Anomaly Detection for Electro-Hydrostatic Actuators using LSTM Autoencoder
Time Series
- Proposes an LSTM autoencoder framework for anomaly detection in EHA sensor signals.
- Achieves high accuracy (99.0%) and precision (100%) in detecting anomalies.
- Demonstrates effectiveness under various fault-injection scenarios.
- Addresses limitations of traditional anomaly detection methods in capturing temporal dependencies.
Read more
Anomaly Detection for Electro-Hydrostatic Actuators using LSTM Autoencoder
Summary
This paper addresses the challenge of anomaly detection in Electro-Hydrostatic Actuators (EHAs), which are critical in aerospace and industrial applications. Traditional statistical and classical machine learning methods often fail to effectively capture the temporal dependencies in EHA sensor data, leading to poor detection accuracy and high false-alarm rates. The authors propose an offline anomaly detection framework utilizing a Long Short-Term Memory (LSTM) autoencoder, specifically designed to handle univariate sensor signals such as temperature and pressure. The framework is calibrated using reconstruction-error distributions from validation sets and evaluated under various fault-injection scenarios. The results demonstrate that the LSTM autoencoder achieves an average accuracy of 99.0%, with precision reaching 100%, recall between 90.2% and 99.6%, and F1-scores from 93.1% to 99.8%. These findings indicate that the proposed method is highly sensitive to anomalies while maintaining a low false-alarm rate. The authors suggest that future work will adapt this framework for online anomaly detection, aiming for real-time performance in operational environments.
Methodology
The study employs a reconstruction-based LSTM autoencoder to analyze univariate EHA sensor signals. The model is calibrated using validation-set reconstruction-error distributions and evaluated through multiple fault-injection scenarios. Performance metrics include accuracy, precision, recall, and F1-score, along with sensitivity analyses under varying operational conditions.
Results
The LSTM autoencoder achieved an average accuracy of 99.0%, precision of 100%, recall ranging from 90.2% to 99.6%, and F1-scores between 93.1% and 99.8%. These results indicate high detection sensitivity and a very low false-alarm rate across all evaluated sensors.
Implications
The findings suggest that data-driven offline anomaly detection is feasible for EHAs, with potential applications in safety-critical systems. The future adaptation for online detection could enhance real-time monitoring capabilities, improving safety and reliability in aerospace and industrial contexts.
Adaptive state-action abstractions via rate-distortion
Reinforcement Learning
Robotics
Theory
- Introduces soft state-action abstractions that adaptively adjust granularity during learning.
- Develops a learning-abstraction decomposition that separates value error into learning and abstraction errors.
- Proposes an adaptive abstraction principle that refines abstractions based on learning progress.
- Demonstrates the effectiveness of the framework on classic tabular control benchmarks.
Read more
Adaptive state-action abstractions via rate-distortion
Summary
This paper addresses the challenge of dynamically adjusting the granularity of state-action abstractions in reinforcement learning (RL). The author proposes a principle for refining abstractions based on the relationship between learning error and abstraction error. Specifically, the paper introduces a performance certificate that decomposes value error into a Bellman residual, which measures the learning error, and an abstraction error bound defined by a bisimulation metric. The proposed method employs soft state-action abstractions derived from rate-distortion principles, allowing for continuous adjustment of resolution along state and action axes. The framework is validated through various tabular settings, demonstrating that it can achieve near-optimal performance even with significant lossy compression of state and action information. This work contributes to the understanding of how to refine abstractions in RL, providing a systematic approach to balance the trade-off between abstraction granularity and learning efficiency.
Methodology
The methodology involves creating a continuous family of soft state-action abstractions using rate-distortion principles. The paper formalizes a switching strategy that refines abstractions when the learning error (Bellman residual) becomes comparable to the abstraction error (bisimulation metric). This approach allows for independent control over the resolution of state and action information.
Results
The results indicate that the adaptive abstraction method can effectively trace compression-distortion frontiers, achieving near-optimal performance under substantial compression of state and action information. The framework also identifies whether state, action, or their interaction is the limiting factor in performance for specific tasks.
Implications
The findings suggest that reinforcement learning agents can be designed to dynamically adjust their abstraction strategies, leading to more efficient learning processes. This has potential applications in robotics, where agents must adapt to varying levels of detail in their environments, as well as in other domains requiring efficient decision-making under uncertainty.
HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
Graph Learning
Time Series
Interpretability
- Introduces a temporal knowledge-infused hypergraph framework for modeling EHR data.
- Proposes a dynamic hypergraph state space model to capture higher-order relationships and long-range temporal information.
- Demonstrates significant performance improvements over existing state-of-the-art models on clinical prediction tasks.
- Establishes theoretical guarantees for the robustness of the learned representations.
Read more
HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
Summary
The paper introduces HoT-SSM, a novel framework designed to enhance the modeling of electronic health records (EHRs) through higher-order temporal knowledge graphs. Traditional medical knowledge graph (MKG) approaches struggle to capture complex interactions among clinical concepts and often overlook long-range temporal dependencies essential for accurate clinical predictions. HoT-SSM addresses these limitations by constructing hypergraphs that group semantically related clinical concepts into hyperedges, thereby preserving the clinical context of each patient visit. The framework employs a dynamic state space model to effectively learn the evolving latent state of patients over time while maintaining higher-order relational structures. The empirical evaluation on MIMIC-III and MIMIC-IV datasets demonstrates significant performance improvements in clinical prediction tasks, showcasing the effectiveness of jointly modeling higher-order interactions and long-range temporal dependencies. The proposed methodology not only enhances prediction accuracy but also provides a structured reasoning mechanism for clinical insights.
Methodology
HoT-SSM constructs hypergraphs for each patient visit by grouping related clinical concepts into hyperedges, utilizing domain knowledge. A dynamic state space model is then applied to these temporal hypergraphs to learn the evolving latent states of patients, capturing both higher-order interactions and long-range temporal dependencies.
Results
The experiments conducted on MIMIC-III and MIMIC-IV datasets indicate that HoT-SSM significantly outperforms existing models in various clinical prediction tasks, validating its effectiveness in modeling complex clinical interactions and temporal dynamics.
Implications
The proposed framework has the potential to improve clinical decision-making processes by providing more accurate predictions and interpretable insights into patient health trajectories. It can be applied in various healthcare settings for tasks such as mortality prediction and treatment recommendation.
TailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning
Efficient ML
NLP
Large Language Models
- TailLoR introduces a low-rank adaptation method that protects dominant singular components during continual learning.
- The method employs a soft spectral penalty to guide updates towards less critical, lower-rank components.
- TailLoR does not require access to prior task adapters, enhancing privacy for sequential adaptations.
- The approach matches or exceeds the performance of existing state-of-the-art continual learning methods.
Read more
TailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning
Summary
The paper presents TailLoR, a novel low-rank adaptation method designed to enhance continual learning by protecting the principal components of pre-trained models. Traditional parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), often face challenges due to interference from overlapping update directions when adapting models across multiple tasks. TailLoR addresses this issue by utilizing the singular value decomposition (SVD) of weight matrices to create a fixed reference frame for updates. It introduces a soft spectral penalty that discourages modifications to dominant singular directions, thereby preserving prior knowledge while allowing for flexible adaptations in lower-rank 'tail' components. The authors demonstrate that TailLoR can effectively adapt to new tasks without requiring access to previous task adapters, which is crucial for maintaining user privacy. The method is evaluated across various continual learning tasks, showing competitive performance with state-of-the-art techniques while increasing the stable rank of the weight matrix, indicating improved adaptability and reduced interference.
Methodology
TailLoR utilizes singular value decomposition (SVD) to extract the structural geometry of pre-trained weight matrices. It defines updates in a spectral basis using low-rank matrices, applying a soft regularization that penalizes changes to dominant singular values while allowing more flexibility in lower-rank updates. This approach mitigates interference from overlapping updates in continual learning scenarios.
Results
The evaluation of TailLoR on a suite of continual learning tasks indicates that it achieves performance comparable to state-of-the-art methods. Additionally, it increases the stable rank of the weight matrix, suggesting enhanced adaptability and reduced interference when learning new tasks sequentially.
Implications
TailLoR's ability to protect critical model components while allowing for efficient adaptations has significant implications for applications in continual learning, particularly in scenarios where privacy and model stability are paramount. This method could be particularly useful in fields such as personalized AI, where user-specific adaptations are necessary without compromising previous knowledge.
Catastrophic Forgetting as Accessibility Collapse: A Three-Level Framework for Knowledge Persistence in Continual Learning
Theory
- Introduces a three-level framework for understanding knowledge in neural networks: storage, representation, and accessibility.
- Demonstrates that catastrophic forgetting is primarily an accessibility issue, with earlier layers retaining or improving their representational quality.
- Establishes the Accessibility Gap and Projection Energy as new diagnostic metrics for continual learning.
- Shows that a classifier reset can recover 75.7% of original task performance without modifying the backbone.
Read more
Catastrophic Forgetting as Accessibility Collapse: A Three-Level Framework for Knowledge Persistence in Continual Learning
Summary
This paper challenges the conventional view of catastrophic forgetting in deep neural networks, which is often seen as the loss of previously learned knowledge. Instead, the authors propose a three-level framework that distinguishes between knowledge storage, representation, and accessibility. Through experiments on the Split CIFAR-100 dataset using a ResNet-18 architecture, they demonstrate that while task accuracy can drop to zero, significant task-relevant information remains encoded in the network's earlier layers. Their findings indicate that forgetting is primarily an accessibility issue rather than a complete erasure of knowledge. The authors introduce new metrics, the Accessibility Gap and Projection Energy, to diagnose and quantify these phenomena. They also show that a simple classifier reset can recover a substantial portion of the original task performance, highlighting the recoverability of knowledge in earlier layers. This work lays the groundwork for a repair-oriented approach to continual learning, emphasizing the importance of understanding knowledge accessibility in neural networks.
Methodology
The authors conducted controlled sequential training experiments on the Split CIFAR-100 dataset using a ResNet-18 architecture. They analyzed task accuracy, linear probe retention, and performed classifier reset experiments to assess knowledge recoverability. A layer-wise analysis was also conducted to understand the retention of knowledge across different layers of the network.
Results
The results revealed that after sequential training on 10 tasks, task accuracy collapsed to 0.000, while linear probe accuracy remained at 76% of checkpoint-level performance. A classifier reset recovered 75.7% of the original task accuracy. Layer-wise analysis indicated that earlier layers retained or improved their accuracy, while degradation was concentrated in the final layer and readout stage.
Implications
The findings suggest that strategies for continual learning should focus on enhancing knowledge accessibility rather than solely preventing forgetting. This could lead to more effective methods for training neural networks on multiple tasks without significant performance degradation. The introduction of new diagnostic metrics may also aid in the development of future continual learning frameworks.
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training
Reinforcement Learning
Large Language Models
Optimization
- Identifies limitations in existing reinforcement learning methods for long-horizon LLM training, particularly regarding credit assignment.
- Proposes Evidence-Calibrated Policy Optimization (ECPO) as a solution to improve credit reliability.
- ECPO combines techniques to group rollouts and suppress noise, enhancing stability in training.
- Demonstrates significant performance improvements over GiGPO in empirical experiments.
Read more
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training
Summary
This paper addresses the challenges of training long-horizon large language model (LLM) agents using reinforcement learning (RL), particularly the issue of assigning credit to intermediate decisions in environments with sparse and delayed rewards. The authors critique existing methods, such as GiGPO, which attempt to improve credit assignment by providing denser step-level advantages at repeated anchor states. They identify a significant limitation in these methods: the statistical unreliability of dense credit, which can lead to divergent anchor bias and oscillations during late-stage training. To overcome these issues, the authors propose a novel algorithm called Evidence-Calibrated Policy Optimization (ECPO). ECPO is a critic-free policy optimization approach that calibrates step-level credit before policy updates. It incorporates two key components: Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions to reduce the impact of low-count estimates, and Variance-Gated Credit Weighting, which mitigates the influence of actions dominated by noise. The experimental results demonstrate that ECPO consistently outperforms strong baselines, including GiGPO, achieving significant improvements in success rates on tasks in ALFWorld and WebShop while maintaining minimal additional computational overhead.
Methodology
The authors developed ECPO, a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. It utilizes Evidence-Calibrated Action Advantage to group rollouts by canonical actions and shrink low-count estimates, alongside Variance-Gated Credit Weighting to suppress noise-dominated anchor states.
Results
ECPO outperformed GiGPO by +5.2 success points on ALFWorld and +7.3 on WebShop with Qwen2.5-1.5B/7B models, while adding only 0.1% additional overhead in advantage computation. The final reward standard deviation was reduced, indicating improved stability during training.
Implications
The findings suggest that calibrating credit assignment in RL can lead to more stable and effective training of LLM agents in complex environments. This has potential applications in autonomous decision-making systems and other areas requiring long-horizon planning.
Robust and sparse support vector machine via hybrid truncated loss for supervised classification
Optimization
Theory
- Introduction of a hybrid truncated loss function (Lht) that balances boundedness and sparsity.
- Development of Lht-SVM for single-view classification with improved robustness to outliers.
- Extension to multi-view learning with MvLht-SVM, adhering to both consensus and complementarity principles.
- Demonstrated superior performance in accuracy and robustness compared to existing methods.
Read more
Robust and sparse support vector machine via hybrid truncated loss for supervised classification
Summary
This paper addresses the limitations of traditional support vector machines (SVMs) in handling outliers and computational efficiency by introducing a hybrid truncated loss function (Lht). The authors develop the Lht-SVM model for single-view classification, which utilizes this new loss function to achieve both sparsity and robustness. They establish first-order necessary and sufficient optimality conditions using the concept of P-stationary points and propose an alternating direction method of multipliers (ADMM) combined with a working-set strategy to enhance computational efficiency and ensure global convergence. The model is further extended to multi-view learning, resulting in the MvLht-SVM, which incorporates structural information and view weights to adhere to the principles of consensus and complementarity. Experimental results demonstrate that Lht-SVM outperforms five existing single-view methods in terms of accuracy and noise robustness, while MvLht-SVM surpasses six multi-view methods across various metrics including accuracy, precision, recall, and F1-score.
Methodology
The authors propose a hybrid truncated loss function (Lht) and develop the Lht-SVM model for single-view classification. They establish optimality conditions using P-stationary points and design an alternating direction method of multipliers (ADMM) with a working-set strategy to optimize the model efficiently. The model is extended to multi-view learning, resulting in MvLht-SVM, which integrates structural information and view weights.
Results
The experimental results indicate that Lht-SVM achieves higher accuracy with fewer support vectors and better noise robustness compared to five single-view methods. MvLht-SVM outperforms six multi-view methods in terms of accuracy, precision, recall, and F1-score, showcasing the effectiveness of the proposed models.
Implications
The proposed hybrid truncated loss and corresponding SVM models could enhance classification tasks in various domains, particularly where data is prone to outliers or comes from multiple sources. This approach may be beneficial in fields such as computer vision, natural language processing, and biomedical analysis, where robust and efficient classification is critical.
Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping
Optimization
Time Series
Computer Vision
- In-season crop mapping is essential for timely responses to climate-related agricultural threats.
- Support Vector Machines outperformed other algorithms with a mean F1 score of 0.74 for almonds and 0.59 for corn.
- Interannual variability significantly affects mapping accuracy, indicating the need for robust validation methods.
- The study combines remote sensing data with crop rotation history for improved mapping accuracy.
Read more
Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping
Summary
This study addresses the critical need for in-season crop type mapping to enhance food security amidst climate-related threats. The authors highlight the limitations of existing USDA Cropland Data Layer products, which are only available post-harvest. By utilizing Harmonized Landsat-Sentinel surface reflectance imagery and crop rotation history, the authors successfully mapped corn in Iowa and almonds in California at a 30m resolution by early June in unseen years. The research involved comparing thousands of model configurations across ten machine learning algorithms through year-wise cross-validation. The findings revealed that Support Vector Machines achieved the highest performance, with a mean F1 score of 0.74 for almonds and 0.59 for corn across five unseen validation years. The study also identified interannual variation as a significant source of uncertainty, suggesting that ensemble methods or additional data could further enhance model performance. Future research directions include developing multiclass maps for various crop types and forecasting in-season crop yields.
Methodology
The authors employed a combination of Harmonized Landsat-Sentinel surface reflectance imagery and crop rotation history to create crop maps. They conducted a comprehensive evaluation of ten different machine learning algorithms by performing hyperparameter optimization and year-wise cross-validation, assessing model performance across multiple unseen years.
Results
The study found that Support Vector Machines provided the best overall performance, achieving a mean F1 score of 0.74 for almond mapping and 0.59 for corn mapping by early June in unseen validation years. The analysis highlighted interannual variation as a major source of uncertainty in crop mapping.
Implications
The findings underscore the potential of machine learning in enhancing real-time agricultural monitoring and decision-making, which is crucial for food security. The methods developed could be applied to create timely crop maps and forecasts, aiding emergency management and agricultural planning.
Efficient Mean Curvature Computation on High-Dimensional Data Manifolds
Efficient ML
Theory
Graph Learning
- Introduces an exact algebraic identity that reduces mean curvature computation cost from O(m^4) to O(m^2).
- Utilizes truncated SVD to further reduce computational complexity in high-dimensional settings.
- Demonstrates significant speedups (50 to 300 times) in real-world datasets without substantial accuracy loss.
- Establishes local mean curvature as a practical geometric feature for diverse machine learning applications.
Read more
Efficient Mean Curvature Computation on High-Dimensional Data Manifolds
Summary
This paper addresses the computational challenges of estimating local mean curvature in high-dimensional datasets, which is crucial for geometry-aware machine learning algorithms. The naive approach, based on k-nearest neighbor patches, incurs a prohibitive O(m^4) cost per point due to the explicit construction of a matrix H. The author introduces two key contributions to significantly reduce this cost. First, an exact algebraic identity is derived that eliminates the need for matrix H, reducing the per-point cost to O(m^2) after eigendecomposition. Second, the paper tackles the remaining O(m^3) bottleneck by leveraging the low rank of the local covariance matrix, replacing full eigendecomposition with a truncated SVD, resulting in a total cost of O(k^2m + kmp^2), where p = k - 1. Experimental results demonstrate speedups of 50 to 300 times compared to the original method, with minimal loss in accuracy. This work establishes a scalable and data-driven approach to local curvature estimation, making it applicable across various machine learning tasks, including deep learning pipelines.
Methodology
The methodology involves deriving an algebraic identity to reformulate the mean curvature estimator, eliminating the need for a costly matrix construction. It also employs truncated singular value decomposition to exploit the low rank of the local covariance matrix, thus optimizing the computational efficiency.
Results
The proposed method achieves a computational cost of O(k^2m + kmp^2) and demonstrates empirical speedups of 50 to 300 times over previous implementations, confirming its effectiveness in estimating local mean curvature in high-dimensional datasets.
Implications
This research has significant implications for geometric machine learning, enabling efficient estimation of curvature, which can enhance various tasks such as clustering, classification, and representation learning in high-dimensional spaces.
StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis
Graph Learning
Theory
Interpretability
- StableRCA is a graph-agnostic framework that identifies root causes without requiring a known causal graph.
- The framework utilizes local Markov boundaries to differentiate true root causes from marginal anomalies.
- Theoretical guarantees are provided for the identification of intervention targets based on conditional distribution shifts.
- Extensive experiments show StableRCA's robustness to graph misspecification and effectiveness across diverse datasets.
Read more
StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis
Summary
The paper introduces StableRCA, a novel framework for Root-Cause Analysis (RCA) that operates independently of a global causal graph. Traditional RCA methods often struggle with graph misspecification, leading to inaccurate identification of root causes in complex systems. StableRCA addresses this issue by focusing on local Markov boundaries and detecting conditional distribution shifts, which allows for robust identification of intervention targets. The authors leverage the Independent Causal Mechanism principle to ensure that true root causes can be distinguished from downstream effects. The framework combines marginal shift screening, local Markov boundary estimation, and conditional distribution shift detection, providing a scalable and effective solution for high-dimensional and heterogeneous data. Extensive experiments demonstrate that StableRCA outperforms existing methods in terms of accuracy and robustness across various application domains, including manufacturing, cloud computing, and healthcare.
Methodology
StableRCA employs a three-step approach: first, it screens for variables exhibiting marginal shifts; second, it estimates local Markov boundaries for these variables; and finally, it ranks potential root causes based on the strength of their conditional distribution shifts. This method avoids the need for global causal graph discovery and focuses on local conditioning to enhance accuracy.
Results
The experiments conducted on synthetic benchmarks and five real-world datasets indicate that StableRCA is robust against graph misspecification, effective in identifying multiple intervention targets, and scalable to large systems. The results demonstrate strong accuracy and reliability across various application domains, outperforming traditional RCA methods.
Implications
StableRCA has significant implications for industries that rely on RCA for system diagnostics, such as manufacturing, IT services, and healthcare. Its ability to function without a global causal graph makes it particularly valuable in real-world scenarios where such graphs are difficult to obtain or inaccurate.
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Multimodal
- LEVANTE-bench provides a comprehensive dataset for comparing VLMs with children's cognitive performance.
- The benchmark evaluates VLMs across multiple cognitive tasks and languages, enhancing cross-cultural comparisons.
- Alignment between VLMs and children's cognitive abilities varies significantly across different scales of evaluation.
- Current VLM architectures show limitations in matching children's cognitive error distributions, particularly in complex reasoning tasks.
Read more
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Summary
This paper introduces LEVANTE-bench, a benchmark designed to evaluate vision-language models (VLMs) against human cognitive development in children aged 5-12. By utilizing tasks from the Learning Variability Network (LEVANTE), the authors systematically assess VLMs across six cognitive tasks, comparing their performance to that of 1547 children from three countries. The study highlights the potential of VLMs in modeling human cognition but reveals that current architectures only partially align with children's cognitive abilities. The evaluation framework measures alignment at multiple scales, including task difficulty, item difficulty, and trial-level error distributions. Results indicate that while larger models generally align better with task difficulties, their performance varies significantly at the item level and in matching children's error patterns. Notably, even the best-performing VLMs struggle with complex reasoning tasks, suggesting limitations in their cognitive modeling capabilities.
Methodology
The authors constructed LEVANTE-bench using psychometrically validated tasks from the LEVANTE framework, assessing VLMs on six cognitive tasks. They compared model performance with data from children across three scales: task difficulty, item difficulty, and trial-level error distributions, using a dataset of over 1500 children.
Results
The analysis revealed that larger VLMs align better with task difficulties but show modest alignment on item difficulties and heterogeneous performance in trial-level error distributions. Smaller models sometimes matched younger children's errors more closely, and all models struggled with matrix reasoning and mental rotation tasks.
Implications
The findings suggest that while VLMs have potential as cognitive models, there are significant gaps in their alignment with human cognitive processes. This highlights the need for further research and development in VLM architectures to enhance their applicability in understanding human learning and cognition.
Differentiable Efficient Operator Search
Efficient ML
Multimodal
- Introduction of a unified operator space that consolidates various token reduction methods.
- Development of the Efficient Operator Search framework that automates the search for optimal operator configurations.
- Demonstration of competitive performance against existing baselines across multiple benchmarks.
- Reinterpretation of traditional token reduction techniques as special cases of a shared operator framework.
Read more
Differentiable Efficient Operator Search
Summary
This paper introduces the Efficient Operator Search (EOS), a novel framework aimed at enhancing the efficiency of multimodal foundation models by unifying various human-designed token reduction operators into a single operator space. The authors argue that existing methods like pruning, merging, pooling, and adaptive reweighting can be interpreted as different regimes of a shared operator space, which can be controlled by continuous parameters. EOS transforms the design of efficient models from a manual process into an automatic search problem, allowing for the optimization of layer selection, token budgets, and operator parameters simultaneously. The proposed method is evaluated on multiple benchmarks, demonstrating that it can recover strong performance from existing baselines while also discovering improved operator configurations. This work suggests a paradigm shift in efficient modeling, providing a unified perspective on prior methods and laying a foundation for future research in this area.
Methodology
The Efficient Operator Search framework parameterizes the choices of where to reduce visual tokens, how many to retain, and which operator to apply as learnable variables. It employs differentiable relaxations to optimize these parameters under task and efficiency constraints, allowing for flexible combinations of compression techniques.
Results
The EOS framework successfully recovers the performance of established hand-designed baselines and identifies superior operator configurations, leading to enhanced performance-efficiency trade-offs across various multimodal benchmarks.
Implications
The findings suggest that EOS could significantly streamline the design of efficient multimodal models, enabling researchers to focus on optimizing performance without the need for extensive manual tuning of compression techniques. This could lead to broader applications in areas requiring efficient processing of large-scale multimodal data.
MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry
NLP
Large Language Models
Generative Models
- MOLE-RAG is a training-free framework that enhances LLM-based molecular property prediction.
- It integrates three types of context: literature retrieval, molecular context injection, and structural retrieval.
- The framework significantly improves prediction accuracy across various tasks, outperforming SMILES-only baselines.
- Context source utility varies by model and task, indicating the need for adaptive strategies in molecular predictions.
Read more
MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry
Summary
The paper introduces MOLE-RAG, a novel framework designed to enhance molecular property prediction using large language models (LLMs) by integrating retrieval-augmented generation techniques. Traditional LLMs struggle with molecular representations like SMILES due to their training on natural language. MOLE-RAG addresses this by providing three complementary sources of context during inference: retrieved literature, molecule-specific context (including synonyms, identifiers, and physicochemical descriptors), and structurally similar molecules. The framework is evaluated across nine molecular property prediction tasks, demonstrating significant improvements in performance metrics such as ROC-AUC and regression RMSE compared to a baseline that uses SMILES alone. The study reveals that the utility of different context sources varies across models and tasks, highlighting the importance of tailored context augmentation for effective molecular predictions.
Methodology
MOLE-RAG employs a retrieval-augmented generation approach that combines BM25-based textual retrieval, molecular context injection, and structural retrieval using task-adaptive molecular fingerprints. This multi-faceted context augmentation allows the framework to connect SMILES inputs with relevant chemical knowledge and examples without requiring additional model training.
Results
The evaluation of MOLE-RAG across nine molecular property prediction tasks shows improvements of up to 28 points in ROC-AUC for classification tasks and a reduction of up to 67% in regression RMSE compared to the SMILES-only baseline. The effectiveness of context sources varies, with some models benefiting more from specific types of context.
Implications
The findings suggest that MOLE-RAG can significantly enhance the predictive capabilities of LLMs in chemistry, potentially leading to more efficient drug discovery processes by prioritizing promising candidates and reducing reliance on experimental testing.
Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification
Reinforcement Learning
Theory
Optimization
- Introduces KL misspecification formulations for contextual bandits and episodic RL.
- Establishes high-probability KL-regret guarantees with explicit misspecification terms.
- Combines Gibbs quadratic self-bounding inequalities with regression-based algorithms.
- Demonstrates that standard realizable KL-regularized settings are recoverable as special cases.
Read more
Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification
Summary
This paper addresses the challenges of KL-regularized contextual bandits and episodic reinforcement learning (RL) in the presence of model misspecification, where the chosen function approximation does not accurately represent the environment. Traditional guarantees in reinforcement learning rely on the assumption of realizability, which fails in misspecified models, leading to inadequate regret bounds. The authors introduce KL misspecification formulations tailored for contextual bandits and episodic RL, analyzing regression-based algorithms with Gibbs policy updates. They establish high-probability KL-regret guarantees that incorporate explicit misspecification terms, thereby recovering the standard realizable KL-regularized setting as a special case. The work highlights the importance of accounting for misspecification in KL-regularized contexts, providing a theoretical framework that connects KL-specific features with regret analysis. This research is particularly relevant for applications in reinforcement learning from human feedback (RLHF) and large language model (LLM) post-training pipelines, where stability and performance trade-offs are critical.
Methodology
The authors develop KL misspecification formulations for both contextual bandits and episodic RL, utilizing a pointwise misspecification model for bandits and a stagewise condition for episodic RL. They employ regression-based algorithms with Gibbs policy updates and analyze the resulting KL-regret through a combination of self-bounding inequalities and reductions that relate on-policy Q-gaps to Bellman residuals.
Results
The paper establishes high-probability KL-regret guarantees that explicitly account for misspecification, providing a theoretical foundation for KL-regularized contextual bandits and episodic RL. The results demonstrate that the proposed methods can effectively manage the complexities introduced by model misspecification, yielding regret bounds that are dependent on the level of misspecification and the structure of the chosen function class.
Implications
The findings have significant implications for the design of reinforcement learning algorithms, particularly in scenarios where model misspecification is prevalent, such as in RLHF and LLM applications. The theoretical guarantees provided can guide the development of more robust RL systems that maintain performance despite approximations in the underlying models.
Sharp Low-Degree Thresholds for Planted-vs-Planted Testing
Theory
- Establishment of sharp low-degree thresholds for planted-vs-planted testing.
- Development of a low-degree certificate framework for testing and recovery.
- Identification of strong and weak testing thresholds in planted models.
- Demonstration that testing thresholds do not depend on the specific pair of planted structures.
Read more
Sharp Low-Degree Thresholds for Planted-vs-Planted Testing
Summary
This paper establishes the first sharp thresholds for low-degree polynomial tests in planted-vs-planted settings, focusing on distinguishing between two structured planted mechanisms that generate observed data. The authors prove matching upper and lower bounds for counting communities in both the planted submatrix and planted dense subgraph models. The resulting testing threshold aligns with the known low-degree recovery threshold, while the task of weak testing shows a smooth transition rather than a sharp threshold. The authors develop a framework for planted-vs-planted testing that builds on latent-variable expansions and introduces new methods for identifying and pruning non-signal contributions. This work contributes to the understanding of sharp computational phase transitions in high-dimensional inference, particularly in the context of planted structures.
Methodology
The authors utilize a low-degree certificate framework to analyze planted-vs-planted testing. This framework reduces the low-degree lower bound to a linear certificate problem supported on informative subgraphs, allowing for the establishment of sharp thresholds. The methodology involves both theoretical analysis and the development of new techniques to handle non-signal contributions effectively.
Results
The paper presents sharp low-degree thresholds for counting communities in the planted submatrix and planted dense subgraph models. It identifies the exact leading constant for the low-degree strong-testing threshold and delineates the weak-testing scale. The results indicate that the strong-testing threshold remains consistent across different pairs of planted structures, challenging previous assumptions about the ease of distinguishing between certain configurations.
Implications
The findings have significant implications for high-dimensional inference tasks, particularly in areas involving community detection and structured data analysis. The sharp thresholds established can guide the development of efficient algorithms for distinguishing between complex distributions, enhancing the understanding of computational limits in statistical models.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
Theory
- Establishes a theoretical lower bound demonstrating the curse of dimensionality for ReLU networks in uniform convergence.
- Develops a comprehensive theoretical framework for smooth DNNs, including novel pseudo-dimension and approximation guarantees.
- Proves that smooth DNNs can achieve better uniform convergence rates compared to ReLU networks across various regression tasks.
- Provides empirical support through simulation studies and real-world applications.
Read more
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
Summary
This paper presents a theoretical framework addressing the uniform convergence of deep neural networks (DNNs) with smooth activations, highlighting the limitations of standard ReLU networks which, while achieving optimal rates in L2(P) norm for nonparametric regression, suffer from the curse of dimensionality in uniform convergence. The authors establish a theoretical lower bound that illustrates this issue, emphasizing the need for reliable uniform guarantees in various statistical tasks. To overcome these limitations, the paper analyzes smoothly activated DNNs, including both feedforward and residual architectures, and introduces novel pseudo-dimension bounds, non-asymptotic approximation guarantees, and H"older-norm bounds for these models. The results demonstrate that smooth DNNs can effectively mitigate the curse of dimensionality by leveraging the low-dimensional hierarchical structure of target functions, providing stronger uniform convergence rates across multiple statistical contexts such as Huber, least-squares, quantile, and logistic regression. The findings are supported by simulation studies and a real-world application, positioning smooth DNNs as a robust alternative to ReLU networks for statistical learning tasks requiring uniform guarantees.
Methodology
The authors conduct a theoretical analysis of smooth DNNs, establishing bounds on pseudo-dimension, approximation errors, and uniform convergence rates. They utilize interpolation inequalities and develop a set of theoretical tools to analyze the performance of smooth activations in deep networks.
Results
The paper demonstrates that smooth DNNs can achieve non-asymptotic uniform convergence rates that are significantly better than those of ReLU networks, effectively mitigating the curse of dimensionality. The results are validated through both theoretical proofs and empirical simulations.
Implications
The findings suggest that smooth DNNs can be reliably used in statistical learning tasks that require uniform convergence, such as individualized treatment recommendations and confidence band construction, thereby enhancing the applicability of deep learning in statistical contexts.
SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter
Reinforcement Learning
Large Language Models
Optimization
- Identifies the inefficiency of increasing rollouts in GRPO-style RLVR due to low-rank redundancy in gradient features.
- Introduces SALT, a method that reweights group-relative updates to enhance learning effectiveness.
- Demonstrates that SALT improves update geometry and performance across diverse benchmarks.
- Provides a new perspective on the structural inefficiencies in group-based policy optimization.
Read more
SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter
Summary
The paper addresses a critical issue in reinforcement learning with verifiable rewards (RLVR), specifically in the context of Group Relative Policy Optimization (GRPO). The authors identify that simply increasing the number of rollouts does not enhance learning effectiveness due to the emergence of low-rank, signed geometries in policy-gradient features, leading to significant cancellation during aggregation. To mitigate this, they propose SALT (Subspace-Adaptive geometry pLug-in componenT), which reweights group-relative updates by estimating a dominant shared subspace from mini-batch gradient geometry. SALT decomposes the coefficients into shared and residual channels, amplifying the residual channel to counteract the effects of signed cancellation. The authors validate SALT through extensive experiments across various RLVR benchmarks and model scales, demonstrating that it improves effective update geometry and overall performance without altering the reward model or rollout sampling procedures.
Methodology
The authors propose SALT, which utilizes mini-batch gradient geometry to estimate a dominant shared subspace. It decomposes group-relative coefficients into shared and residual components and adaptively amplifies the residual channel to reduce cancellation effects during aggregation. This approach is evaluated against standard benchmarks in RLVR, focusing on both task performance and update-geometry diagnostics.
Results
SALT consistently outperformed traditional GRPO methods across various reasoning-oriented RLVR benchmarks and model scales. The experiments showed improved effective update directions and overall performance, confirming the efficacy of the proposed method in addressing the identified redundancy issues.
Implications
The findings suggest that enhancing the aggregation process in group-based policy optimization can lead to significant improvements in reinforcement learning applications, particularly in optimizing large language models and other reasoning-oriented tasks. SALT could be integrated into existing RL pipelines to boost learning efficiency without requiring changes to the underlying reward structures.
Causal Modeling of Selection in Evolution
Theory
- Distinction between static and evolutionary selection is crucial for causal discovery.
- Existing models for static selection do not adequately capture evolutionary processes.
- A new model for evolutionary selection is introduced, improving causal analysis.
- The proposed methodology is validated through experimental results.
Read more
Causal Modeling of Selection in Evolution
Summary
This paper addresses the crucial role of selection in causal discovery, distinguishing between two forms of selection: static and evolutionary. Static selection is a one-time filtering process, while evolutionary selection involves repeated rounds of differential fitness affecting the data over generations. The authors critique existing methods that conflate these two forms, highlighting that the common graphical model used for static selection fails to accurately represent evolutionary selection, leading to false discoveries. To remedy this, they propose a new model specifically designed for evolutionary selection, along with a comprehensive procedure for identifying such models from data across multiple environments or generations. Experimental results demonstrate the effectiveness of their method in uncovering the underlying mechanisms of evolution from data, emphasizing the importance of correctly modeling selection in causal analysis.
Methodology
The authors developed a new model tailored for evolutionary selection and created a procedure for identifying these models from observational data across different environments or generations. They conducted experiments to validate the model's effectiveness in revealing the mechanisms underlying evolutionary processes.
Results
The experimental results confirmed that the new model successfully identifies the relevant mechanisms of evolutionary selection, outperforming traditional static selection models in terms of accuracy and reliability.
Implications
This work has significant implications for fields such as biology, social sciences, and any domain where understanding the dynamics of selection is essential for causal inference and decision-making. It highlights the need for appropriate modeling of selection processes to avoid biases in causal analysis.
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
NLP
Large Language Models
- CALIDIST introduces a behavior-centric approach to calibrate LLMs by assessing their robustness to distractions.
- The method quantifies prediction changes and confidence shifts when prompts are perturbed with distractors.
- Extensive experiments show significant reductions in Expected Calibration Error (ECE) and Brier Score compared to existing methods.
- The findings suggest a strong correlation between prediction stability and model accuracy, indicating that susceptibility to distractions is a reliable proxy for error likelihood.
Read more
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
Summary
This paper introduces CALIDIST, a novel post-hoc calibration method for Large Language Models (LLMs) that emphasizes behavioral robustness to distractions. Traditional calibration methods often fail to account for a model's stability when faced with irrelevant or misleading information. The authors argue that a model's confidence should reflect its ability to maintain accurate predictions under cognitive pressure. CALIDIST measures how predictions and uncertainties change when input prompts are perturbed with semantic distractors, providing a stability signal that is used to adjust the model's initial confidence scores. The method was evaluated across seven Natural Language Understanding classification benchmarks using six different LLMs, demonstrating significant improvements in calibration metrics. The results indicate that CALIDIST reduces the Expected Calibration Error (ECE) from 23% to 7% on average, showcasing a 70% relative improvement. This highlights the importance of behavioral stability as a key factor in model calibration, offering a new perspective on enhancing trustworthiness in LLMs.
Methodology
CALIDIST systematically perturbs input prompts with targeted semantic distractors to measure prediction instability and confidence stability. It aggregates these behavioral signals to adaptively scale the model's original confidence score, linking them to the model's accuracy.
Results
The application of CALIDIST led to a reduction in Expected Calibration Error (ECE) from 23% to 7% on average across multiple benchmarks, demonstrating a 70% relative improvement. The method consistently outperformed strong baseline calibration techniques.
Implications
The findings suggest that enhancing behavioral robustness in LLMs can significantly improve their calibration, which is crucial for applications in high-stakes environments where trustworthiness and reliability are paramount. This could lead to safer and more effective deployment of LLMs in various domains.
PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability
Time Series
Theory
Interpretability
- PyCC.id enables hypothesis-driven equation discovery using structural skeletons.
- The library reduces the ambiguity in model selection by incorporating domain knowledge and prior information.
- Structural skeletons ensure that the discovered models are physically consistent and interpretable.
- PyCC.id supports various paradigms for equation discovery, enhancing its flexibility and applicability.
Read more
PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability
Summary
The paper introduces PyCC.id, an open-source Python library designed for hypothesis-driven equation discovery, particularly focusing on the structural identifiability of ordinary differential equations (ODEs) derived from time-series data. The challenge of data-driven equation discovery lies in its ill-conditioned nature, where multiple mathematical models can fit the same data, leading to ambiguity and reliance on manual filtering by practitioners. PyCC.id addresses this by allowing users to define structural 'skeletons' based on their domain knowledge, which serve as templates for families of ODEs. This approach not only reduces the search space but also ensures that the identified models are physically consistent and interpretable. The library supports various equation discovery paradigms, including neural networks and symbolic regression, and facilitates the integration of prior information into the model-building process. By enforcing structural skeletons with identifiable properties, PyCC.id provides a systematic way to validate or discard proposed models, thus enhancing the reliability of the discovered equations.
Methodology
The methodology involves defining structural skeletons that represent families of ODEs, allowing practitioners to integrate hypotheses and constraints based on their domain knowledge. The library facilitates the iterative refinement of models and supports different parametrization strategies for the functions within the skeletons, including functional, parametric, and hybrid approaches.
Results
The implementation of PyCC.id allows for the systematic discovery of ODEs from time-dependent data, ensuring that the identified models are structurally identifiable and consistent with physical principles. The library's modularity enables the application of various equation discovery techniques, enhancing its utility across different scientific and engineering disciplines.
Implications
The development of PyCC.id has significant implications for fields requiring accurate modeling of dynamic systems, such as physics, engineering, and biology. By providing a structured approach to equation discovery, it can lead to more reliable models that better reflect underlying physical processes, ultimately aiding in scientific understanding and technological advancements.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Reinforcement Learning
NLP
Large Language Models
- CVT-RL introduces a constrained policy-gradient algorithm with dense verifiable rewards.
- The PCCC estimator evaluates the contribution of interventions to final success.
- Significant improvements in task success rates and evidence accuracy were achieved.
- The methodology reduces reward hacking incidents compared to existing baselines.
Read more
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Summary
This paper introduces CVT-RL, a novel constrained policy-gradient algorithm designed to enhance the performance of long-horizon language agents in reinforcement learning (RL) settings. Traditional RL approaches often lead to issues such as unsupported evidence chains and belief drift, particularly in complex tasks requiring reasoning and tool use. CVT-RL addresses these challenges by incorporating dense verifiable rewards and a policy-conditioned counterfactual contribution (PCCC) estimator, which evaluates the impact of specific interventions on the likelihood of achieving verified success. The methodology includes various controlled interventions such as deletion and semantic substitution, and employs a selection-adjusted doubly robust estimator to improve the learning process. The algorithm was tested across multiple benchmarks, including long-context question answering and web/tool tasks, demonstrating significant improvements in task success rates and evidence accuracy while reducing instances of reward hacking. The results indicate that CVT-RL provides a more reliable framework for training language agents, ensuring that they engage in meaningful reasoning rather than exploiting shortcuts.
Methodology
The CVT-RL algorithm employs a constrained policy-gradient approach that integrates dense verifiable rewards, intervention-validity gating, and a PCCC estimator. It defines specific interventions to assess their impact on task outcomes and utilizes a frozen reference policy for counterfactual continuations. The methodology also includes diagnostics for overlap and failure modes in sequential language trajectories.
Results
CVT-RL improved average task success rates from 71.8% (non-causal RL) and 75.4% (information-matched baseline) to 78.9%. Evidence F1 scores increased from 78.9 to 82.8, and instances of hacking were reduced from 7.2% to 3.9%. Human audits indicated a hacking rate of 4.6% for CVT-RL compared to 8.1% for the baseline, with adaptive attacks raising hacking only to 7.1%. Statistical tests confirmed the significance of these improvements (p < 0.01).
Implications
The findings suggest that CVT-RL can significantly enhance the reliability and effectiveness of long-horizon language agents in complex tasks, making it a valuable approach for applications in natural language processing, automated reasoning, and tool use. This could lead to more robust AI systems capable of performing intricate tasks with higher accuracy and less exploitation of flawed reward structures.
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Generative Models
- Introduction of GILC as a training-free guidance framework for discrete diffusion models.
- Utilization of a Jacobian-free mechanism for stable logit correction to address gradient instability.
- Formal connection established between GILC and policy gradients for handling non-differentiable objectives.
- Demonstration of state-of-the-art performance in constrained generation tasks across multiple scientific domains.
Read more
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Summary
This paper introduces Gradient-Informed Logit Correction (GILC), a novel plug-and-play framework designed to enhance controllable generation in discrete diffusion models without the need for retraining. The authors address the challenges associated with high computational costs and gradient instability in discrete spaces by employing a Jacobian-free mechanism that corrects clean prediction logits. GILC leverages a pretrained denoising network as a variational proxy to estimate guidance signals, allowing it to accommodate both differentiable and non-differentiable reward functions. The framework is validated through extensive experiments in various scientific domains, including DNA and protein sequence generation, demonstrating that GILC achieves state-of-the-art performance while maintaining computational efficiency. The results indicate that GILC not only outperforms existing training-free methods but also rivals fine-tuning approaches, showcasing its potential for advancing discrete diffusion guidance.
Methodology
The GILC framework employs a variational method where a pretrained denoising network serves as a proxy for estimating value functions. It combines the Gumbel-Softmax trick with a Straight-Through estimator to maintain gradient flow in discrete spaces, and utilizes a Jacobian-free update to correct clean prediction logits for effective guidance.
Results
GILC was shown to significantly outperform existing training-free discrete diffusion guidance methods in terms of sample quality and computational efficiency. It achieved competitive results compared to fine-tuning approaches, establishing state-of-the-art performance in controlled discrete generation tasks across DNA, protein sequences, and molecular generation.
Implications
The GILC framework has the potential to streamline the process of generating optimized samples in various scientific and industrial applications, such as drug design and genetic engineering, without the need for extensive retraining or additional computational resources.
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Theory
- Introduces the concept of 'two training clocks' to explain the separation of fitting and representation simplification in deep learning.
- Demonstrates that classification loss decreases logarithmically while representation simplification occurs on a polynomial time scale.
- Establishes a connection between deep linear networks and ReLU networks, highlighting the role of activation patterns in training dynamics.
- Provides empirical evidence through experiments on modular arithmetic tasks that support the theoretical framework.
Read more
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Summary
This paper investigates the phenomenon of 'grokking' in deep learning, where a model achieves low training error but only later discovers a simpler underlying rule that generalizes well to new data. The authors introduce the concept of 'two training clocks' to formalize the separation between the fast decay of classification loss and the slower simplification of learned representations. They analyze deep linear networks and demonstrate that the classification loss can decrease logarithmically, while the representation simplification follows a polynomial time scale due to layerwise weight decay acting as a Schatten-type penalty. The study also extends these findings to ReLU networks, showing that once activation patterns stabilize, the network behaves like a linear model in active coordinates. The experiments conducted on modular addition tasks provide empirical support for the theory, illustrating that while the classifier can fit early, the representation continues to evolve, emphasizing the importance of extended training for improving generalization.
Methodology
The authors analyze deep linear networks using cross-entropy loss and layerwise weight decay to establish the two training clocks. They employ mathematical frameworks to describe the fast and slow training dynamics and extend the analysis to ReLU networks through conditional reductions, focusing on the behavior of gradients and activation patterns during training.
Results
The study finds that in deep linear networks, the effective classifier can achieve low loss quickly, while the representation simplification takes longer due to the influence of layerwise weight decay. In ReLU networks, once activation patterns stabilize, the network behaves like a linear model, allowing for a two-stage learning process where the classifier fits first and the representation simplifies later. Experimental results on modular addition tasks corroborate these findings.
Implications
The insights from this research could inform training strategies for deep learning models, emphasizing the need for extended training periods to enhance generalization capabilities. Understanding the dynamics of fitting versus representation simplification may lead to improved model architectures and training protocols.
Diffusion Models for Adaptive Sequential Data Generation
Generative Models
Time Series
Theory
- Introduction of the AD-Seq framework for adapted sequential data generation.
- Ensures that generated values respect temporal dependencies and information flow.
- Development of a novel score-matching objective for parallel training.
- Statistical guarantees for score approximation and distribution estimation.
Read more
Diffusion Models for Adaptive Sequential Data Generation
Summary
This paper addresses the challenge of generating realistic synthetic sequential data using diffusion models, which have been successful in static data generation but struggle with temporal dependencies in sequential settings. The authors propose a novel framework called Adaptive Diffusion for Sequential Data Generation (AD-Seq), which generates time series data in an adapted manner, ensuring that each generated value only depends on past and present information. The framework introduces a sequential forward-backward diffusion process that progressively injects and removes noise while conditioning on previously generated history. The authors also develop a score-matching objective for efficient parallel training and provide rigorous statistical guarantees for the framework. Empirical validation on synthetic data, including ARMA models and Gaussian processes, demonstrates the effectiveness of AD-Seq in constructing mean-variance optimal portfolios, showcasing its potential for various applications in finance, healthcare, and operations research.
Methodology
The authors propose a sequential forward-backward diffusion framework that generates time series data by conditioning on previously generated history. They introduce a score-matching objective that allows for parallel training of score functions, ensuring scalability and efficiency. The framework is analyzed under a statistical learning theory that provides guarantees for score approximation and distribution estimation.
Results
The AD-Seq framework successfully generates synthetic sequential data that adheres to the required temporal dependencies and information flow. Empirical results demonstrate its effectiveness in constructing mean-variance optimal portfolios, validating the framework's practical applicability in real-world scenarios.
Implications
The proposed framework has significant implications for various fields requiring sequential data generation, such as finance for portfolio optimization, healthcare for predictive modeling, and operations research for decision-making under uncertainty. It opens avenues for multi-step prediction and statistical inference while respecting causal structures.
Pretraining Recurrent Networks without Recurrence
Theory
Efficient ML
NLP
- Introduces Supervised Memory Training (SMT) as an alternative to BPTT for RNN training.
- SMT allows for time-parallel training and stable gradient paths, improving learning of long-range dependencies.
- Memory transition labels are generated using a Transformer model, decoupling memory representation from memory dynamics.
- SMT outperforms BPTT in various tasks, demonstrating its effectiveness in training nonlinear RNNs.
Read more
Pretraining Recurrent Networks without Recurrence
Summary
The paper introduces Supervised Memory Training (SMT), a novel approach to training recurrent neural networks (RNNs) that eliminates the need for recurrent credit propagation, a common challenge in traditional training methods like backpropagation through time (BPTT). BPTT suffers from issues such as vanishing gradients and limited parallelism due to its sequential nature. SMT addresses these limitations by transforming RNN training into a supervised learning problem focused on one-step memory transition labels. These labels are generated by a Transformer-based encoder-decoder model that predicts future states based on past information. By decoupling the tasks of memory representation and memory dynamics, SMT allows for time-parallel training of RNNs, resulting in a stable O(1) gradient path for long-range dependencies. The authors demonstrate that SMT significantly outperforms BPTT in tasks like language modeling and pixel sequence modeling, enabling RNNs to better capture long-range dependencies while facilitating parallel training. This approach not only enhances the training efficiency of RNNs but also opens avenues for scaling models that abstract temporal experiences.
Methodology
The authors propose SMT, which involves training a Transformer encoder to create memory representations that summarize past information necessary for predicting future states. This enables RNNs to learn memory updates in a supervised manner without the need for recurrent unrolling, thus allowing for parallel training.
Results
SMT was shown to outperform BPTT in language modeling and pixel sequence modeling tasks, effectively capturing long-range dependencies and reducing the computational burden associated with sequential training.
Implications
SMT has the potential to enhance the scalability and efficiency of RNNs in various applications, particularly in scenarios requiring the modeling of temporal abstractions. It may also influence future research on representation learning and world models.