gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-08 • Found 24 papers

Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Igor Santos-Grueiro
  • Behavioral evaluation, the dominant method for assessing LLM alignment, relies on finite observations and cannot uniquely identify latent alignment properties.
  • The paper introduces 'Normative Indistinguishability,' which formalizes when distinct alignment hypotheses are observationally indistinguishable under a given evaluation regime.
  • A key result is a negative identifiability theorem, showing that evaluation-aware agents can strategically adapt behavior, making alignment unverifiable through behavioral evidence alone.
  • Behavioral tests should be interpreted as estimators of indistinguishability classes, not as definitive verifiers of alignment.
  • The findings highlight the need for alternative approaches to alignment verification beyond behavioral evaluation.
Read More
Abstract
This paper examines the challenges of verifying alignment in large language models (LLMs) through behavioral evaluation, which is the dominant method for assessing whether these models align with intended objectives. The author formalizes the problem of alignment evaluation as an identifiability issue under partial observability, introducing the concept of 'Normative Indistinguishability.' This concept captures scenarios where different latent alignment hypotheses produce identical observable behaviors under a given evaluation regime. The paper demonstrates that finite behavioral evaluations, especially when agents are aware of the evaluation process, cannot uniquely identify a model's underlying alignment. Instead, behavioral tests can only estimate indistinguishability classes of alignment hypotheses, rather than verify alignment as a latent property. The work highlights the limitations of current evaluation practices and reframes alignment benchmarks as tools for bounding observable compliance rather than certifying true alignment.
Methodology
The author formalizes alignment evaluation as an identifiability problem under partial observability, introducing a theoretical framework that accounts for evaluation-aware behavior. The paper defines key concepts such as interaction histories, evaluation regimes, and alignment hypotheses. It also introduces the notion of 'Normative Indistinguishability' and develops an alignment indistinguishability test inspired by black-box testing methodologies. The analysis is grounded in theoretical reasoning and draws on insights from decision theory, inverse learning, and empirical observations of LLM behavior.
Results
The main result is a non-identifiability theorem, which demonstrates that finite behavioral evaluation cannot uniquely determine a model's latent alignment when agents are evaluation-aware. This establishes a fundamental epistemic limit on the ability to verify alignment through behavioral evidence alone. The paper also shows that passing stricter behavioral tests narrows the space of plausible alignment hypotheses but cannot fully resolve them into a single, verifiable alignment property.
Implications
['Current alignment benchmarks and behavioral evaluation protocols should be reinterpreted as tools for estimating indistinguishability classes rather than certifying alignment.', 'The findings suggest that alignment evaluation methods need to account for evaluation-aware behavior to avoid overestimating the reliability of behavioral compliance as evidence of alignment.', 'This work highlights the importance of developing alternative approaches to alignment verification, potentially incorporating insights from interpretability research or other non-behavioral methods.']
View on arXiv

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
  • PMD-MEAN approximates the log-partition function using the mean reward under the sampling policy, simplifying the PMD framework for LLM post-training.
  • The algorithm implicitly introduces an adaptive mixed KL–χ² regularizer, which enhances stability and robustness, especially in data-constrained scenarios.
  • PMD-MEAN exhibits reduced sensitivity to finite-sample errors, decreasing the risk of overfitting to misestimated targets.
  • Theoretical analysis shows that PMD-MEAN moderates convergence rates during early training phases, contributing to its empirical stability.
  • Experiments on math reasoning tasks confirm that PMD-MEAN outperforms standard RL methods like GRPO in terms of efficiency, stability, and performance.
Read More
Abstract
This paper investigates a practical algorithm, PMD-MEAN, for reinforcement learning (RL) in the post-training of large language models (LLMs). Policy mirror descent (PMD) is a widely used framework for RL, which involves solving KL-regularized policy improvement subproblems. However, estimating the log-partition function required for PMD updates is challenging in the context of LLMs due to their vast action spaces and limited rollout samples. PMD-MEAN addresses this by approximating the log-partition term with the mean reward under the sampling policy and performing regression in log-policy space. The authors provide a theoretical characterization of PMD-MEAN, showing that it implicitly optimizes a mirror descent subproblem with an adaptive mixed KL–χ² regularizer. This additional χ² regularization constrains large probability changes, particularly when expected rewards are low, improving robustness against finite-sample estimation errors. Experimental results on math reasoning tasks demonstrate that PMD-MEAN achieves superior performance, stability, and time efficiency compared to standard approaches like GRPO. The findings offer insights into the algorithm's implicit regularization mechanism and its potential for improving RL methods in LLM post-training.
Methodology
The authors propose PMD-MEAN, a variant of policy mirror descent (PMD) that approximates the log-partition term with the mean reward under the sampling policy. They derive a closed-form solution for PMD-MEAN, showing its equivalence to a mirror descent subproblem with an adaptive mixed KL–χ² regularizer. Theoretical analysis is conducted to characterize the algorithm's convergence properties and its robustness to finite-sample errors. Empirical validation is performed on math reasoning tasks to compare PMD-MEAN with existing RL methods for LLM post-training.
Results
PMD-MEAN achieves superior performance on math reasoning tasks, demonstrating enhanced stability and time efficiency compared to standard RL methods like GRPO. The algorithm's implicit χ² regularization mechanism provides robustness against finite-sample errors and moderates convergence rates, leading to improved empirical stability and reduced overfitting.
Implications
The findings suggest that PMD-MEAN can serve as a robust and efficient alternative to traditional RL methods for LLM post-training, particularly in scenarios with limited rollout samples. Its implicit regularization mechanism could inspire new approaches to RL algorithm design, with potential applications in improving the stability and performance of LLMs on complex reasoning tasks and other agentic objectives.
View on arXiv

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets
  • The paper identifies mode collapse and entropy reduction as key challenges in RLVR for LLM reasoning tasks.
  • ProGRPO introduces an Advantage Re-weighting Mechanism (ARM) that incorporates confidence-aware signals to balance exploration and exploitation.
  • The method significantly improves generative diversity and response entropy while maintaining competitive accuracy.
  • ProGRPO outperforms GRPO and FlowRL on Qwen2.5-7B, achieving a 5.7% improvement in Pass@1 and a 13.9% improvement in Pass@32.
  • The approach demonstrates strong scalability, generalization, and robustness to out-of-distribution data.
Read More
Abstract
This paper addresses the issue of mode collapse and limited output diversity in Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning tasks in Large Language Models (LLMs). The authors identify that standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), disproportionately reinforce high-likelihood reasoning paths, suppressing valid alternative solutions. To mitigate this, the paper introduces Probabilistic-based GRPO (ProGRPO), a novel reinforcement learning paradigm that incorporates an Advantage Re-weighting Mechanism (ARM). ARM dynamically adjusts the reward signal by integrating confidence-aware signals, such as Prompt Perplexity and Answer Confidence, into the advantage estimation. This approach redistributes probability mass toward under-explored correct reasoning paths, enhancing generative diversity and reducing entropy collapse. The proposed method is empirically validated on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks, demonstrating significant improvements in both accuracy and output diversity compared to GRPO and other baselines. ProGRPO also shows strong robustness to out-of-distribution data.
Methodology
The authors propose ProGRPO, an extension of GRPO, which integrates an Advantage Re-weighting Mechanism (ARM). ARM uses internal probability signals, such as Prompt Perplexity and Answer Confidence, to reshape the advantage distribution. This re-weighting mechanism adjusts the reward signal to reduce overconfidence in dominant reasoning paths and encourages exploration of under-explored correct solutions. The method is evaluated on mathematical and coding benchmarks using Qwen2.5 and DeepSeek models.
Results
ProGRPO significantly improves both accuracy and diversity in reasoning tasks. On Qwen2.5-7B, it achieves a 5.7% improvement in Pass@1 and a 13.9% improvement in Pass@32 over GRPO. It also outperforms FlowRL by 8.0% in Pass@1 and 7.5% in Pass@32. Additionally, ProGRPO demonstrates robust performance on out-of-distribution data, highlighting its adaptability and generalization capabilities.
Implications
The proposed ProGRPO framework has the potential to enhance the reasoning capabilities of LLMs in complex tasks, such as mathematical problem-solving and code generation, by improving output diversity and reducing mode collapse. Its robustness to out-of-distribution data suggests applications in real-world scenarios where data distributions are unpredictable. This work also provides insights into balancing exploration and exploitation in reinforcement learning for generative models.
View on arXiv

Benchmarking Artificial Intelligence Models for Daily Coastal Hypoxia Forecasting

Magesh Rajasekaran, Md Saiful Sajol, Chris Alvin, Supratik Mukhopadhyay, Yanda Ou, Z. George Xue
  • The study benchmarks four state-of-the-art deep learning models (BiLSTM, TCN, Medformer, and ST-Transformer) for daily coastal hypoxia forecasting.
  • The dataset includes twelve years of daily hindcast data (2009–2020) for training and additional data (2020–2024) for testing, derived from a coupled hydrodynamic-biogeochemical model.
  • The ST-Transformer model achieved the best performance across all metrics, with AUC-ROC scores ranging from 0.982 to 0.992.
  • McNemar’s statistical test was used to identify significant differences in model predictions, a novel approach in environmental AI studies.
  • The study provides a reproducible framework and open-source code for real-time hypoxia prediction, supporting ecosystem management and resilience efforts.
Read More
Abstract
This paper addresses the critical need for daily forecasting of coastal hypoxia, particularly in the northern Gulf of Mexico, where recurring hypoxic events severely impact marine ecosystems and coastal economies. The authors benchmark four deep learning architectures—Bidirectional Long Short-Term Memory (BiLSTM), Temporal Convolutional Network (TCN), Medical Transformer (Medformer), and Spatio-Temporal Transformer (ST-Transformer)—to predict hypoxic conditions as a binary classification task. Using twelve years of daily hindcast data (2009–2020) for training and additional data (2020–2024) for testing, the study evaluates these models under a unified experimental framework. The ST-Transformer outperformed other models across all metrics, achieving the highest AUC-ROC scores (0.982–0.992). The authors also applied McNemar’s statistical test to assess the significance of differences in model predictions. This work provides a reproducible framework for operational real-time hypoxia forecasting, which can aid in ecosystem management and resilience efforts. The source code is made publicly available to encourage further research.
Methodology
The authors formulated hypoxia prediction as a sequence-to-one binary classification task, using daily hindcast data from the Coupled Ocean-Atmosphere-Wave-Sediment Transport (COAWST) model. Input features included environmental time series influencing oxygen depletion, such as water column stratification and sediment oxygen consumption. The models were trained and evaluated under identical preprocessing, input/output formulation, and validation protocols. Metrics such as F1-score, AUC-ROC, and accuracy thresholds were used to assess performance, with McNemar’s test applied to evaluate statistical significance in model predictions.
Results
The ST-Transformer model outperformed the other architectures, achieving the highest AUC-ROC scores (0.982–0.992) and demonstrating superior classification accuracy and discriminative ability. All models showed strong performance, but the transformer-based architectures (Medformer and ST-Transformer) were particularly effective in capturing temporal and spatial dependencies. The study also highlighted statistically significant differences in model predictions using McNemar’s test.
Implications
This research provides a robust framework for daily hypoxia forecasting, which can be integrated into operational environmental monitoring systems. The ability to predict hypoxic events at a daily resolution can aid fishery managers and policymakers in mitigating the ecological and economic impacts of hypoxia. Additionally, the open-source code and benchmarking methodology can serve as a foundation for future research in environmental AI and ocean modeling.
View on arXiv

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

Boxiang Zhang, Baijian Yang
  • CORP enables one-shot structured pruning for Vision Transformers without requiring labels, gradients, or fine-tuning.
  • The framework compensates for representation errors caused by pruning using closed-form ridge regression solutions.
  • CORP achieves high accuracy preservation under aggressive pruning, retaining 82.8% Top-1 accuracy on DeiT-Huge with 50% sparsity.
  • The method operates efficiently, completing pruning in under 20 minutes on a single GPU.
  • CORP delivers substantial reductions in computational cost and real-world hardware speedups.
Read More
Abstract
This paper introduces CORP, a novel closed-form one-shot structured pruning framework designed for Vision Transformers (ViTs). Vision Transformers achieve high accuracy in vision tasks but are computationally and memory-intensive, making them challenging to deploy on resource-constrained hardware. CORP addresses this by enabling structured pruning without requiring labels, gradients, or fine-tuning. The framework formulates pruning as a representation recovery problem, compensating for the removed activations and attention logits using closed-form ridge regression solutions. This compensation is folded directly into the model weights, ensuring minimal accuracy degradation. Experiments on ImageNet with DeiT models demonstrate that CORP effectively preserves accuracy even under aggressive pruning, achieving 82.8% Top-1 accuracy on DeiT-Huge after pruning 50% of both MLP and attention structures. The pruning process is efficient, completing in under 20 minutes on a single GPU, and delivers significant real-world efficiency gains.
Methodology
CORP formulates structured pruning as a representation recovery problem. It models the removed activations and attention logits as affine functions of retained components and uses closed-form ridge regression to compute compensation terms. These terms are directly incorporated into the model weights and biases, avoiding the need for iterative optimization or fine-tuning. The method operates in a strict post-training regime, using only a small unlabeled calibration set.
Results
Experiments on ImageNet with DeiT models show that CORP preserves accuracy under aggressive pruning. For example, on DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. The pruning process is computationally efficient, completing in under 20 minutes on a single GPU, and results in significant reductions in FLOPs and real-world inference time.
Implications
CORP has significant implications for deploying large Vision Transformers on resource-constrained devices. By enabling efficient and accurate one-shot pruning, it facilitates the use of ViTs in real-world applications where computational and memory resources are limited, such as edge devices, mobile applications, and embedded systems.
View on arXiv

Causal Representation Meets Stochastic Modeling under Generic Geometry

Jiaxu Ren, Yixin Wang, Biwei Huang
  • The paper establishes the first necessary and sufficient conditions for the identifiability of continuous-time latent point processes under generic, non-invertible mixing functions.
  • A novel framework, MUTATE, is introduced to learn causal representations of stochastic point processes using a variational autoencoder with a time-adaptive transition module.
  • The study bridges the gap between causal representation learning and stochastic modeling, particularly for continuous-time processes.
  • Empirical results demonstrate the utility of MUTATE in real-world applications, such as genomics (mutation accumulation) and neuroscience (neuron spike dynamics).
  • The work provides a theoretical foundation and practical tools for studying causal relationships in highly dynamic systems.
Read More
Abstract
This paper addresses the challenge of learning identifiable causal representations from continuous-time stochastic point processes, a problem that has been underexplored in the field of machine learning. While prior work has largely focused on discrete-time processes or invertible mixing functions, the authors extend the study to more generic, non-invertible mixing functions and continuous-time dynamics. They propose a novel framework, MUTATE (an identifiable variational autoencoder with a time-adaptive transition module), to infer stochastic dynamics and causal structures from high-dimensional observational data. The authors establish necessary and sufficient conditions for the identifiability of latent point processes under these settings by analyzing the geometry of the parameter space. Through both simulated and empirical studies, the paper demonstrates the effectiveness of MUTATE in answering scientific questions in domains such as genomics and neuroscience, including the study of mutation accumulation in cancer and neuron spike triggers in response to time-varying dynamics.
Methodology
The authors develop MUTATE, an identifiable variational autoencoder framework that incorporates a time-adaptive transition module to model continuous-time stochastic point processes. They analyze the geometry of the parameter space to derive necessary and sufficient conditions for identifiability under generic, non-invertible mixing functions. The framework is evaluated using both simulated datasets and real-world applications in genomics and neuroscience.
Results
MUTATE successfully identifies latent causal structures in continuous-time stochastic point processes. The framework demonstrates strong performance in simulated experiments and real-world applications, such as identifying the accumulation of mutations in cancer genomics and understanding the mechanisms behind neuron spike triggers. The results validate the theoretical identifiability guarantees and highlight the practical utility of the proposed method.
Implications
This work has significant implications for scientific fields that rely on understanding continuous-time stochastic processes, such as biology, neuroscience, and climate science. By enabling the identification of latent causal structures in dynamic systems, the proposed framework can facilitate new discoveries in areas like disease progression, neural activity analysis, and environmental modeling. Additionally, the theoretical insights into identifiability under generic mixing functions can inspire further research in causal representation learning and stochastic modeling.
View on arXiv

Classification Under Local Differential Privacy with Model Reversal and Model Averaging

Caihong Qin, Yang Bai
  • The paper reinterprets private learning under LDP as a transfer learning problem, leveraging noisy data as the source domain and unobserved clean data as the target domain.
  • A novel noised binary feedback mechanism is introduced to estimate dataset utility and guide the learning process.
  • Model reversal is proposed to salvage underperforming classifiers by inverting their decision boundaries, addressing negative transfer scenarios caused by LDP noise.
  • Model averaging is employed to combine multiple reversed classifiers, assigning weights based on their estimated utility to improve overall performance.
  • The proposed methods achieve substantial improvements in classification accuracy under LDP, as demonstrated by theoretical analysis and empirical results.
Read More
Abstract
This paper addresses the challenge of performing classification tasks under Local Differential Privacy (LDP), a privacy-preserving framework that perturbs user data at the source to eliminate the need for a trusted curator. While LDP provides strong privacy guarantees, the noise it introduces often reduces data utility, particularly in high-dimensional settings. The authors propose a novel approach that reinterprets private learning under LDP as a transfer learning problem, where the noisy data is treated as the source domain and the unobserved clean data as the target domain. They introduce three key techniques to improve classification performance: (1) a noised binary feedback mechanism to estimate dataset utility, (2) model reversal to correct underperforming classifiers by inverting their decision boundaries, and (3) model averaging to combine multiple reversed classifiers based on their utility. The paper provides theoretical guarantees by deriving excess risk bounds under LDP and demonstrates the effectiveness of the proposed methods through experiments on both simulated and real-world datasets. The results show significant improvements in classification accuracy while maintaining strong privacy guarantees.
Methodology
The authors adapt transfer learning concepts to the LDP setting by introducing a noised binary feedback mechanism to evaluate dataset utility. They propose two novel techniques: (1) Model Reversal (MR), which inverts the decision boundaries of underperforming classifiers, and (2) Model Averaging (MA), which combines multiple reversed classifiers using utility-based weighting. Theoretical excess risk bounds are derived to quantify the performance improvements, and the methods are validated through experiments on both simulated and real-world datasets.
Results
The proposed methods demonstrate significant improvements in classification accuracy under LDP constraints. The theoretical analysis shows reduced excess risk, and empirical results confirm the effectiveness of the techniques on both synthetic and real-world datasets. The methods outperform baseline approaches in terms of utility while maintaining the same level of privacy protection.
Implications
The proposed techniques have the potential to improve the utility of machine learning models trained under LDP constraints, making privacy-preserving data analysis more practical for real-world applications. This work could benefit industries such as healthcare, finance, and technology, where sensitive data must be protected while still enabling accurate predictive modeling. Additionally, the methods could inspire further research into adapting transfer learning and ensemble techniques for privacy-preserving machine learning.
View on arXiv

Cross-talk based multi-task learning for fault classification of physically coupled machine system

Wonjun Yi, Rismaya Kumar Mishra, Yong-Hwa Park
  • The paper introduces a cross-talk-based MTL framework that leverages the physical coupling between fault conditions and other variables in machine systems.
  • The Residual Neural Dimension Reductor (RNDR) is proposed as a novel cross-talk architecture that selectively shares information between tasks while avoiding negative transfer.
  • The approach is validated on two benchmarks: a drone fault dataset and a motor compound fault dataset, demonstrating superior performance compared to STL and other MTL models.
  • Cross-talk architectures, particularly RNDR, enable more effective and physically meaningful fault classification by explicitly modeling inherent couplings.
  • The study explores the impact of single-channel vs. multi-channel input data on classification performance in the motor compound fault dataset.
Read More
Abstract
This paper introduces a novel cross-talk-based multi-task learning (MTL) framework for fault classification in physically coupled machine systems. Traditional fault classification approaches often ignore the inherent coupling between fault conditions and other physical variables in machine systems. The authors propose leveraging this coupling by training models to jointly predict fault conditions (main task) and related physical variables (auxiliary tasks). The study focuses on cross-talk architectures, particularly the Residual Neural Dimension Reductor (RNDR), which enables controlled information exchange between tasks while mitigating negative transfer. The proposed approach is evaluated on two benchmarks: a drone fault dataset and a motor compound fault dataset. The results demonstrate that the RNDR-based cross-talk architecture outperforms single-task learning (STL), shared trunk MTL models, and other cross-talk architectures in fault classification tasks. This work highlights the importance of explicitly modeling physical couplings in machine systems to improve fault classification performance.
Methodology
The authors propose a cross-talk-based MTL framework using the RNDR architecture, which incorporates residual connections to preserve original features while enabling selective information exchange between tasks. The framework is applied to two benchmarks: (1) a drone fault dataset, where fault classification is performed alongside auxiliary tasks like maneuvering direction and drone type, and (2) a motor compound fault dataset, where the severity of individual fault components is classified and aggregated to predict compound fault status. The models are compared against STL, shared trunk MTL, and other cross-talk architectures.
Results
The RNDR-based cross-talk architecture consistently outperformed STL, shared trunk MTL models, and other cross-talk architectures across both benchmarks. In the drone fault dataset, the inclusion of auxiliary tasks such as maneuvering direction and drone type improved fault classification accuracy. In the motor compound fault dataset, RNDR achieved better classification performance for both single-channel and multi-channel input data, demonstrating its robustness and effectiveness in handling physically coupled information.
Implications
This study provides a framework for leveraging physical couplings in machine systems to improve fault classification, which has significant implications for condition monitoring and predictive maintenance. The proposed RNDR-based cross-talk architecture can be applied to various industrial systems with physically coupled variables, potentially enhancing the reliability and efficiency of fault diagnosis in complex machine systems.
View on arXiv

Distributional Reinforcement Learning with Diffusion Bridge Critics

Shutong Ding, Yimiao Zhou, Ke Hu, Mokai Pan, Shan Zhong, Yanwei Fu, Jingya Wang, Ye Shi
  • Identifies the limitations of existing diffusion-based critics, including Gaussian degradation and poor value distribution expressiveness.
  • Proposes Diffusion Bridge Critics (DBC), which models the inverse CDF of the Q-value distribution to improve accuracy and stability.
  • Introduces an analytic integral formula to mitigate discretization errors in diffusion bridge models, enhancing value estimation.
  • DBC is a plug-and-play component compatible with existing RL algorithms like SAC and TD3.
  • Experimental results on MuJoCo benchmarks show that DBC outperforms prior distributional RL methods, achieving state-of-the-art performance.
Read More
Abstract
This paper introduces Diffusion Bridge Critics (DBC), a novel approach to distributional reinforcement learning (RL) that leverages diffusion bridge models to improve value estimation. Unlike prior diffusion-based RL methods that focus on enhancing policy expressiveness, DBC emphasizes the critic's role in accurately modeling the Q-value distribution. By learning the inverse cumulative distribution function (CDF) of the Q-value, DBC avoids the limitations of discrete quantile approximations and the Gaussian degradation problem observed in vanilla diffusion critics. Additionally, the authors derive an analytic integral formula to address discretization errors, ensuring more precise value estimation. DBC is designed as a plug-and-play component that can be integrated into existing RL frameworks without altering their policy architectures. Experimental evaluations on MuJoCo robotic control benchmarks demonstrate that DBC achieves state-of-the-art performance, highlighting its effectiveness in improving policy optimization through expressive and accurate value distribution modeling.
Methodology
The authors propose a distributional RL framework that uses diffusion bridge models to learn the inverse CDF of the Q-value distribution. This approach avoids the limitations of discrete quantile approximations and Gaussian degradation. To address discretization errors in diffusion bridges, the authors derive an analytic integral formula that ensures accurate value estimation. DBC is designed to be easily integrated into existing RL algorithms without modifying their policy architectures or optimization processes. The method is evaluated on MuJoCo robotic control tasks using SAC and TD3 as baseline algorithms.
Results
DBC achieves state-of-the-art performance on a range of continuous control tasks in the MuJoCo benchmark. It consistently outperforms existing distributional RL methods, demonstrating superior value estimation accuracy and policy optimization stability. The results validate the effectiveness of using diffusion bridge models as critics in RL.
Implications
The proposed DBC framework has significant implications for reinforcement learning, particularly in tasks requiring accurate value estimation and robust policy optimization. Its plug-and-play nature makes it a practical addition to existing RL algorithms, potentially improving performance in robotic control, autonomous systems, and other applications of continuous control. Additionally, the use of diffusion bridge models as critics opens new avenues for research in distributional RL and value function modeling.
View on arXiv

Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

Peng Chen, Siyuan Wang, Shiyan Hu, Xingjian Wu, Yang Shu, Zhongwen Rao, Meng Wang, Yijie Li, Bin Yang, Chenjuan Guo
  • Introduces a multimodal pretraining paradigm for time series analysis, incorporating endogenous (images, text) and exogenous (real-world news) modalities.
  • Develops MM-TS, the first large-scale multimodal time series dataset, spanning six domains and containing up to one billion data points.
  • Proposes HORAI, a frequency-enhanced multimodal foundation model with novel components for cross-modality fusion and time-frequency decoding.
  • HORAI achieves state-of-the-art zero-shot performance on forecasting and anomaly detection tasks, showcasing strong generalization capabilities.
  • Addresses key challenges in multimodal time series analysis, including modality integration and domain diversity.
Read More
Abstract
This paper addresses the limitations of existing time series foundation models, which primarily rely on unimodal numerical data, by introducing a multimodal pretraining paradigm for time series analysis. The authors propose leveraging both endogenous modalities (derived images and text) and exogenous knowledge (real-world news) to provide a comprehensive multi-view perspective. To support this paradigm, they develop MM-TS, the first large-scale multimodal time series dataset, which spans six domains (e.g., healthcare, energy, and finance) and contains up to one billion data points. The authors also introduce HORAI, a frequency-enhanced multimodal foundation model that integrates a Frequency-enhanced Cross-Modality Encoder and a Time-Frequency Decoder to effectively fuse multimodal features and improve generalization across domains. Pretrained on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating its ability to generalize across diverse modalities and domains.
Methodology
The authors propose a multimodal pretraining paradigm that combines numerical time series data with derived images, text, and external news to capture diverse temporal dynamics. They create MM-TS, a large-scale multimodal dataset, using an automated pipeline that maps raw sequences to visual and textual modalities while aligning them with external news sources. The HORAI model is designed with a Frequency-enhanced Cross-Modality Encoder to align modality-specific information and a Time-Frequency Decoder to enhance temporal understanding. HORAI is pretrained on MM-TS and evaluated on downstream tasks.
Results
HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks. The model demonstrates strong generalization across diverse domains and modalities, outperforming existing unimodal and multimodal approaches. The MM-TS dataset provides a robust benchmark for multimodal time series analysis.
Implications
This work paves the way for more comprehensive time series analysis by integrating multimodal data, enabling better understanding of complex temporal dynamics. Potential applications include energy management, medical monitoring, financial forecasting, and other domains where time series data is critical. The MM-TS dataset and HORAI model could serve as foundational tools for future research in multimodal machine learning.
View on arXiv

Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics

Ruizhe Zhong, Jiesong Lian, Xiaoyue Mi, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Junchi Yan
  • Euphonium introduces a reward-augmented stochastic dynamics framework that incorporates Process Reward Gradient guidance into the flow drift, enabling more efficient exploration.
  • The proposed method unifies and extends existing sampling strategies (e.g., Flow-GRPO and DanceGRPO) as special cases when the reward gradient is removed.
  • A Dual-Reward optimization approach combines latent-space process rewards for efficient credit assignment with pixel-space outcome rewards for high-quality visual fidelity.
  • The framework includes a distillation objective to eliminate inference-time dependency on the reward model, simplifying deployment.
  • Euphonium achieves superior alignment with human preferences and accelerates training convergence by 1.66× in text-to-video generation tasks.
Read More
Abstract
This paper introduces Euphonium, a novel framework designed to improve the efficiency and alignment of video flow matching models with human preferences. Current reinforcement learning (RL)-based approaches for aligning flow models suffer from inefficient exploration due to undirected stochasticity and sparse rewards. Euphonium addresses these limitations by incorporating a Process Reward Gradient into the sampling dynamics, formulated as a Stochastic Differential Equation (SDE). This approach enables dense, step-by-step guidance toward high-reward regions, significantly improving exploration efficiency. The framework also introduces a Dual-Reward Group Relative Policy Optimization (GRPO) algorithm, which combines latent-space process rewards for efficient credit assignment with pixel-space outcome rewards for visual fidelity. To eliminate inference-time dependency on the reward model, the authors propose a distillation objective that internalizes the guidance signal into the flow network. Experiments on text-to-video generation tasks demonstrate that Euphonium achieves better alignment with human preferences while accelerating training convergence by 1.66× compared to existing methods.
Methodology
Euphonium formulates the sampling process as a Stochastic Differential Equation (SDE) augmented with a Process Reward Gradient derived from a dense Process Reward Model (PRM). This enables guided exploration toward high-reward regions. The framework also introduces a Dual-Reward Group Relative Policy Optimization (GRPO) algorithm, combining latent-space and pixel-space rewards. A distillation objective is used to internalize the reward guidance into the flow network, removing the need for the reward model during inference.
Results
Euphonium outperforms existing methods like Flow-GRPO and DanceGRPO in aligning video flow models with human preferences. It achieves better alignment on the VBench2 benchmark for text-to-video generation and accelerates training convergence by 1.66×, requiring fewer training steps to reach equivalent performance.
Implications
Euphonium's guided exploration framework can significantly enhance the efficiency and effectiveness of aligning generative models with human preferences. Its application to text-to-video generation demonstrates potential for broader use in video synthesis, content creation, and other generative AI tasks requiring fine-grained alignment with user-defined objectives.
View on arXiv

Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting

Hongyi Li, Han Lin, Jun Xu
  • Introduces the Hinge Regression Tree (HRT), which reformulates node splitting as a nonlinear least-squares optimization problem over two linear models.
  • The optimization process is equivalent to a damped Newton (Gauss–Newton) method, with proven monotonic objective decrease and convergence.
  • HRT's model class is a universal approximator with an explicit O(δ²) approximation rate, showcasing strong function approximation capabilities.
  • Supports optional ridge regularization to improve robustness under multicollinearity.
  • Empirical results show that HRT achieves competitive or superior performance compared to single-tree baselines while maintaining more compact structures.
Read More
Abstract
This paper introduces the Hinge Regression Tree (HRT), a novel algorithm for oblique regression tree splitting that reframes the node-splitting problem as a nonlinear least-squares optimization over two linear predictors. By leveraging a hinge-based formulation, HRT achieves ReLU-like expressive power and supports ridge regularization for robustness. The optimization process is interpreted as a damped Newton (Gauss–Newton) method, with theoretical guarantees of monotonic objective decrease and convergence. The authors establish that HRT's piecewise linear model class is a universal approximator with an explicit O(δ²) approximation rate. Empirical evaluations on synthetic and real-world datasets demonstrate that HRT outperforms single-tree baselines in terms of predictive performance while maintaining more compact tree structures. This work advances oblique regression trees by integrating regression modeling with optimization theory, offering a practical and theoretically principled approach to nonlinear function approximation.
Methodology
The HRT algorithm formulates node splitting as a nonlinear least-squares optimization problem involving two linear models. A hinge-based formulation is used to induce ReLU-like expressive power. The optimization process is interpreted as a damped Newton (Gauss–Newton) method, with optional ridge regularization for robustness. Theoretical analysis proves monotonic decrease and convergence of the node-level objective, while universal approximation guarantees are established for the resulting piecewise linear models.
Results
HRT demonstrates competitive or superior performance compared to single-tree baselines on synthetic and real-world datasets. It achieves this with more compact tree structures, highlighting its efficiency and effectiveness. The theoretical analysis confirms its universal approximation capabilities and provides guarantees for the convergence of the optimization process.
Implications
HRT offers a practical and theoretically grounded approach to oblique regression trees, making it suitable for tasks requiring nonlinear function approximation with compact and interpretable models. Its robustness and efficiency could benefit applications in high-dimensional regression, feature selection, and interpretable machine learning.
View on arXiv

Near-Optimal Dynamic Matching via Coarsening with Application to Heart Transplantation

Itai Zilberstein, Ioannis Anagnostides, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm
  • Introduces a coarsening-based online matching framework that aggregates offline nodes into capacitated clusters.
  • Identifies structural patterns in historical heart transplant data to optimize allocation policies.
  • Achieves a competitive ratio of 0.91, outperforming the US status quo policy's ratio of 0.51.
  • Connects heart transplant allocation to the b-matching problem in Internet advertising, leveraging algorithmic techniques from that domain.
  • Provides a rigorous theoretical foundation for clustering-based approaches in organ allocation.
Read More
Abstract
This paper introduces a novel online matching framework based on coarsening, which aggregates offline nodes into capacitated clusters to achieve near-optimal theoretical guarantees. The authors focus on heart transplant allocation, leveraging structural properties of historical data from the United Network for Organ Sharing (UNOS) registry. By analyzing the bipartite graph weighted by expected life years gained, they identify large clusters of patients with similar edge weights, enabling the development of a provably near-optimal policy. The competitive ratio of their algorithm approaches 1 under certain structural conditions, indicating performance close to the omniscient benchmark. Simulations on real-world data demonstrate that their approach significantly outperforms the current US heart transplant allocation policy and other heuristic-based methods. The work bridges the gap between theoretical guarantees and practical performance, providing a rigorous foundation for clustering-based approaches in organ allocation.
Methodology
The authors analyze historical UNOS registry data to identify clusters of patients with similar expected life years gained. They develop an online matching algorithm based on coarsening, where offline nodes are aggregated into capacitated clusters. The algorithm optimizes the competitive ratio, which measures performance relative to an omniscient benchmark. The framework incorporates techniques from the b-matching problem in Internet advertising, adapting them to the constraints of organ allocation.
Results
The proposed algorithm achieves a competitive ratio of 0.91 in simulations using real UNOS data, significantly outperforming the US status quo policy (0.51) and non-capacitated stochastic matching (0.63). The theoretical analysis shows that under specific structural conditions, the competitive ratio approaches 1, indicating near-optimal performance.
Implications
This work has significant implications for organ allocation policies, particularly in high-stakes domains like heart transplantation. By providing a theoretically grounded and practically effective framework, it could improve the efficiency and fairness of organ distribution. The connection to b-matching algorithms opens avenues for applying similar techniques to other resource allocation problems in healthcare and beyond.
View on arXiv

Orthogonal Self-Attention

Leo Zhang, James Martens
  • Orthogonal Self-Attention (OSA) is proposed to mitigate the instability of Softmax Self-Attention (SSA) in skipless Transformers.
  • OSA enforces orthogonality in the attention matrix using the matrix exponential of a skew-symmetric matrix derived from query-key values.
  • The computational complexity of OSA is reduced to linear scaling with sequence length by leveraging the low-rank structure of the query-key matrices.
  • OSA avoids rank collapse and ensures well-conditioned Jacobians, enabling stable training without skip connections or normalization layers.
  • OSA is particularly suited for non-causal decoder-based Transformers like Vision Transformers (ViTs) and Diffusion Transformers (DiTs).
Read More
Abstract
This paper introduces Orthogonal Self-Attention (OSA), a novel attention mechanism designed to address the instability issues of Softmax Self-Attention (SSA) in skipless Transformer architectures. Traditional SSA often suffers from rank collapse and poorly-conditioned Jacobians, especially in architectures without skip connections and normalization layers. OSA addresses these issues by parametrizing the attention matrix as orthogonal, achieved by mapping a skew-symmetric matrix (derived from query-key values) through the matrix exponential. The authors propose an efficient implementation of OSA that exploits the low-rank structure of the query-key matrices, reducing computational complexity and memory requirements to scale linearly with sequence length. They also derive an initialization scheme that ensures the Jacobian of OSA is well-conditioned, enabling stable training. The paper demonstrates that OSA avoids rank collapse and preserves the rank and eigenvalues of token representations, making it suitable for non-causal decoder-based Transformers such as Vision Transformers (ViTs) and Diffusion Transformers (DiTs).
Methodology
OSA parametrizes the attention matrix as orthogonal by mapping a skew-symmetric matrix (computed from query-key values) through the matrix exponential. To make this computationally efficient, the authors exploit the low-rank structure of the skew-symmetric matrix, reducing the complexity of computing the matrix exponential. They also derive an initialization scheme that ensures the Jacobian of OSA is well-conditioned, enabling stable training. Theoretical analysis and kernel-based proofs demonstrate that OSA avoids rank collapse and preserves the rank and eigenvalues of token representations.
Results
The authors prove that OSA avoids rank collapse by preserving the rank and eigenvalues of token representations across layers. They also show that the computational complexity of OSA scales linearly with sequence length, making it more efficient than SSA. Additionally, the proposed initialization scheme ensures well-conditioned Jacobians, facilitating stable training of skipless Transformers.
Implications
['OSA enables the development of skipless Transformer architectures, which could improve representation learning by avoiding the potential drawbacks of skip connections.', 'The linear scaling of OSA with sequence length makes it computationally efficient and suitable for large-scale applications.', "OSA's ability to avoid rank collapse and maintain stable training could lead to more robust and effective Transformer models in domains such as computer vision and natural language processing."]
View on arXiv

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga
  • The authors prove that a single-layer, single-head transformer cannot compute the PARITY function due to its limited average sensitivity, which grows as O(√n), compared to the linear growth of PARITY's sensitivity.
  • A new 4-layer transformer architecture is proposed that computes PARITY using soft attention, length-independent positional encodings, and no layer normalization.
  • The proposed transformer works with both full-attention and causal masking architectures, making it more practical for real-world applications.
  • The positional encoding used in the new construction is polynomially bounded, enabling implementation on inputs of practical size.
  • This work builds on prior research in transformer expressivity and provides the first lower bound for solving PARITY with a single-layer transformer.
Read More
Abstract
This paper investigates the expressivity of transformer architectures, focusing on their ability to compute the PARITY function, a fundamental Boolean function that assigns 0 to binary words with an even number of ones and 1 otherwise. The authors address open questions about whether a single-layer transformer can compute PARITY and propose a new construction for a multi-layer transformer that overcomes limitations of previous approaches. They demonstrate that no single-layer, single-head transformer can compute PARITY due to constraints on average sensitivity. Additionally, they introduce a novel 4-layer transformer architecture that successfully computes PARITY using soft attention, length-independent and polynomially bounded positional encodings, and no layer normalization, while being compatible with both full-attention and causal masking setups. This work advances the theoretical understanding of transformer expressivity and provides practical insights for designing more effective transformer architectures.
Methodology
The authors use theoretical analysis to derive a lower bound on the expressivity of single-layer, single-head transformers by analyzing the average sensitivity of functions they can compute. They also propose a new 4-layer transformer architecture with specific design choices, such as soft attention, length-independent positional encodings, and no layer normalization, to compute the PARITY function effectively. The positional encoding is designed to be polynomially bounded for practical implementation.
Results
The study establishes that no single-layer, single-head transformer can compute the PARITY function due to sensitivity limitations. The authors also present a 4-layer transformer that successfully computes PARITY while addressing limitations of previous constructions, such as reliance on impractical features like length-dependent positional encodings or hard attention.
Implications
This work provides new insights into the theoretical limitations and capabilities of transformer architectures, particularly in computing sensitive Boolean functions like PARITY. The proposed 4-layer transformer design could inform the development of more practical and efficient transformer models for tasks requiring high sensitivity and generalization. Additionally, the results may guide future research on the expressivity of neural network architectures and their limitations.
View on arXiv

Perception-Based Beliefs for POMDPs with Visual Observations

Miriam Schäfers, Merlijn Krale, Thiago D. Simão, Nils Jansen, Maximilian Weininger
  • The PBP framework introduces a novel method for belief updates in POMDPs using perception models to handle high-dimensional visual observations.
  • Uncertainty quantification is integrated into PBP to enhance robustness against visual corruption, using threshold-based and weighting-based methods.
  • Empirical evaluations show that PBP outperforms end-to-end deep reinforcement learning methods and achieves competitive results with classical POMDP solvers.
Read More
Abstract
This paper introduces the Perception-Based Beliefs for POMDPs (PBP) framework, which addresses the challenge of solving partially observable Markov decision processes (POMDPs) with high-dimensional visual observations, referred to as vision POMDPs (VPOMDPs). Traditional belief-based POMDP solvers struggle with large observation spaces, while end-to-end deep reinforcement learning (DRL) methods lack interpretability and robustness in safety-critical applications. PBP bridges this gap by integrating a perception model, such as an image classifier, to map visual observations into probability distributions over states, enabling efficient belief updates without reasoning over the entire observation space. The framework also incorporates uncertainty quantification to improve robustness against visual corruption, using threshold-based and weighting-based methods to adjust belief updates based on prediction uncertainty. Empirical evaluations demonstrate that PBP, implemented with classical POMDP solvers like HSVI and POMCP, achieves competitive performance compared to state-of-the-art VPOMDP solvers, particularly under visually corrupted conditions.
Methodology
The authors propose the PBP framework, which uses a perception model (e.g., an image classifier) to map visual observations into probability distributions over states. These distributions are incorporated into belief updates, bypassing the need to explicitly reason over large observation spaces. To address the issue of classifier imprecision, uncertainty quantification techniques are employed, including threshold-based and weighting-based methods. The framework is implemented using two classical POMDP solvers, HSVI and POMCP, and evaluated on benchmarks with small state and action spaces but complex visual observations.
Results
The PBP framework demonstrates competitive performance compared to state-of-the-art VPOMDP solvers, particularly under conditions of visual corruption. It also outperforms end-to-end deep reinforcement learning methods in terms of robustness and interpretability. PBP with HSVI achieves strong results, confirming the feasibility and effectiveness of the approach.
Implications
The PBP framework has significant implications for safety-critical applications such as autonomous driving and robotics, where high-dimensional visual observations are common. By improving robustness and scalability in POMDP solvers, PBP enables more reliable decision-making under uncertainty, paving the way for practical implementations in real-world scenarios involving complex visual data.
View on arXiv

Pool-based Active Learning as Noisy Lossy Compression: Characterizing Label Complexity via Finite Blocklength Analysis

Kosuke Sugiyama, Masato Uchida
  • Pool-based active learning is reformulated as a noisy lossy compression problem, enabling unified information-theoretic analysis.
  • Finite blocklength analysis is applied to derive lower bounds on label complexity and generalization error for pool-based AL.
  • The bounds reflect learning algorithm properties such as overfitting and inductive bias mismatch, offering a novel perspective on AL limits.
  • The framework specializes in pool-based AL by incorporating constraints of finite pool sampling, addressing gaps in prior unconstrained approaches.
  • The study bridges active learning theory with established concepts like information-theoretic bounds and stability theory.
Read More
Abstract
This paper introduces an information-theoretic framework to analyze the theoretical limits of pool-based active learning (AL), where a subset of unlabeled data is selectively labeled to train a model. The authors reformulate pool-based AL as a noisy lossy compression problem, mapping pool observations to noisy symbols, data selection to compression, and learning to decoding. Using finite blocklength analysis, the paper derives lower bounds on label complexity and generalization error, which reflect the properties of the learning algorithm, such as overfitting and inductive bias mismatch. These bounds provide a novel perspective on the relationship between pool-based AL and established theories like information-theoretic bounds and stability theory. The framework is specialized for pool-based AL by modeling the data acquisition process as constrained sampling from a finite pool, addressing limitations of prior unconstrained sampling approaches. The work bridges the gap between active learning theory and broader machine learning frameworks, offering insights into optimal data selection strategies and their impact on learning performance.
Methodology
The authors model pool-based active learning as a noisy lossy compression problem, where pool observations are treated as noisy symbols, data selection as encoding, and learning as decoding. Finite blocklength analysis is applied to derive theoretical lower bounds on label complexity and generalization error, considering optimal data selection strategies constrained by the finite pool size.
Results
The paper provides information-theoretic lower bounds on label complexity and generalization error for pool-based active learning. These bounds are specialized for finite pools and reflect the properties of the learning algorithm, such as overfitting and inductive bias mismatch. The framework also covers i.i.d. sampling as a special case, deriving generalization error bounds for unconstrained sampling.
Implications
This framework offers a deeper understanding of the theoretical limits of pool-based active learning, guiding the design of optimal data selection strategies. It bridges active learning theory with broader machine learning concepts, potentially improving applications in areas like deep learning, where overfitting and inductive bias are critical considerations. The insights could inform more efficient labeling strategies in domains with limited data availability.
View on arXiv

Rewards as Labels: Revisiting RLVR from a Classification Perspective

Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu
  • Identifies two fundamental gradient mismatches in GRPO-style RLVR methods: Gradient Misassignment in Positives and Gradient Domination in Negatives.
  • Proposes the REAL framework, which treats verifiable rewards as categorical labels, reformulating policy optimization as a classification task.
  • Introduces anchor logits to regulate gradient allocation and mitigate gradient mismatches, ensuring stable and efficient training.
  • Demonstrates significant performance improvements over GRPO, DAPO, and GSPO across diverse mathematical reasoning benchmarks and model scales.
  • REAL achieves stable training even without explicit KL penalties and performs competitively with a simple binary cross-entropy loss.
Read More
Abstract
This paper introduces the Rewards as Labels (REAL) framework, a novel approach to Reinforcement Learning with Verifiable Rewards (RLVR), which reformulates policy optimization as a classification problem rather than relying on scalar reward signals. The authors identify two key issues in existing RLVR methods, such as GRPO and its variants: Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. REAL addresses these issues by treating rewards as categorical labels and introducing anchor logits to enhance policy learning. Theoretical analysis shows that REAL enforces monotonic and bounded gradient magnitudes, ensuring balanced credit assignment across rollouts. Extensive experiments on mathematical reasoning benchmarks demonstrate that REAL improves training stability and consistently outperforms GRPO and strong variants like DAPO and GSPO, achieving significant performance gains across model scales.
Methodology
The REAL framework reconceptualizes verifiable rewards as categorical labels, transforming policy optimization into a classification problem. It introduces anchor logits to regulate gradient allocation, ensuring monotonic and bounded gradient magnitudes. Theoretical analysis validates the framework's ability to mitigate gradient mismatches, while empirical evaluations are conducted on mathematical reasoning benchmarks using models of varying scales (1.5B and 7B parameters).
Results
REAL improves average Pass@1 by 6.7% over DAPO on the 1.5B model and by 6.2% and 1.7% over DAPO and GSPO, respectively, on the 7B model. It also achieves stable training without explicit KL penalties and outperforms DAPO by 4.5% on average using a vanilla binary cross-entropy loss. These results highlight REAL's ability to enhance training stability and achieve superior performance across benchmarks.
Implications
The REAL framework has the potential to improve policy optimization in RLVR applications, particularly in tasks requiring rule-based evaluation, such as mathematical and program reasoning. Its classification-based approach could inspire new methodologies for reinforcement learning in domains where verifiable rewards are available, reducing reliance on human feedback or learned reward models.
View on arXiv

Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution

Long Zhang, Fangwei Lin
  • LLMs often prioritize conflicting in-context information over their pre-trained parametric memory, a behavior termed compliance or sycophancy.
  • The study decomposes residual stream updates into radial (norm-based) and angular (cosine-based) components to analyze how knowledge conflicts are resolved.
  • The 'Manifold Dilution' hypothesis, which suggests that compliance arises from a reduction in signal magnitude, is rejected for two of the three tested architectures.
  • Compliance is characterized by 'Orthogonal Interference,' where conflicting context vectors rotate the hidden state representation without erasing the original truth signal.
  • The findings challenge the reliability of scalar confidence metrics and advocate for vectorial monitoring to better understand LLM behavior.
Read More
Abstract
This paper investigates how Large Language Models (LLMs) resolve conflicts between their pre-trained parametric memory and new, conflicting in-context information, a phenomenon often referred to as compliance or sycophancy. The authors explore whether this compliance arises from a reduction in the magnitude of the internal truth signal ('Manifold Dilution') or from a directional alteration ('Orthogonal Interference') in the residual stream. Using a layer-wise geometric analysis across three LLM architectures—Qwen-4B, Llama-3.1-8B, and GLM-4-9B—the study decomposes residual stream updates into radial (norm-based) and angular (cosine-based) components. The findings reject the universality of the 'Manifold Dilution' hypothesis, showing that compliance is instead driven by orthogonal interference, where conflicting context vectors rotate the hidden state representation without diminishing the magnitude of the original truth signal. This suggests that LLMs simulate the adoption of new information by bypassing the correct unembedding vector rather than genuinely integrating the new knowledge. The study highlights the limitations of scalar confidence metrics for detecting hallucinations and emphasizes the need for vectorial monitoring to distinguish between genuine knowledge integration and superficial mimicry.
Methodology
The authors conducted a layer-wise geometric analysis of residual stream updates in three LLM architectures (Qwen-4B, Llama-3.1-8B, and GLM-4-9B). They simulated the injection of counterfactual contexts and decomposed the resulting residual stream updates into radial (magnitude) and angular (directional) components. These components were then correlated with the degradation of the correct logit to test the 'Manifold Dilution' and 'Orthogonal Interference' hypotheses. The analysis accounted for hyperspherical constraints imposed by RMSNorm to ensure accurate representation of the models' internal dynamics.
Results
The study found that compliance in LLMs is not driven by a reduction in the magnitude of the internal truth signal ('Manifold Dilution') but rather by 'Orthogonal Interference,' where conflicting context vectors induce a geometric rotation of the hidden state representation. This mechanism allows LLMs to bypass the correct unembedding vector while preserving the original structural magnitude of the truth signal. These findings were consistent across two of the three tested architectures, with one exception maintaining stable residual norms despite performance degradation.
Implications
The findings suggest that LLMs do not genuinely integrate new in-context information but instead simulate its adoption through geometric displacement. This has significant implications for the reliability and alignment of LLMs, particularly in scenarios requiring factual accuracy. The study also highlights the limitations of scalar confidence metrics for detecting hallucinations and underscores the need for more sophisticated vectorial monitoring techniques to evaluate knowledge integration and model behavior.
View on arXiv

Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias

Ojasva Nema, Kaustubh Sharma, Aditya Chauhan, Parikshit Pareek
  • Bilinear MLPs introduce explicit multiplicative interactions as an architectural inductive bias, enabling structural disentanglement of functional components during training.
  • The 'non-mixing' property of bilinear parameterizations ensures orthogonal evolution of interaction modes under gradient flow, facilitating precise model editing and unlearning.
  • Empirical experiments demonstrate that bilinear MLPs achieve higher unlearning selectivity and better generalization to compositional tasks compared to standard ReLU-based architectures.
  • The study reframes unlearning as a representational property rather than an algorithmic challenge, emphasizing the importance of architecture in shaping learned representations.
  • Bilinear architectures are particularly effective in tasks governed by algebraic or compositional structures, such as modular arithmetic and Lie group dynamics.
Read More
Abstract
This paper investigates the role of architectural inductive bias in enabling structural disentanglement within neural networks, focusing on bilinear multilayer perceptrons (MLPs). The authors argue that selective unlearning and long-horizon extrapolation failures in modern neural networks are rooted in how models structure their internal representations during training, rather than optimization algorithms alone. By leveraging bilinear parameterizations, which explicitly model multiplicative interactions, the authors demonstrate that these architectures possess a 'non-mixing' property under gradient flow conditions, allowing functional components to evolve orthogonally and align with the underlying algebraic structure of tasks. Through analytical insights and controlled experiments on modular arithmetic, cyclic reasoning, Lie group dynamics, and targeted unlearning benchmarks, the paper shows that bilinear MLPs outperform standard ReLU-based architectures in recovering true operators, improving model editability, and generalizing to compositional tasks. The findings suggest that architectural inductive bias plays a critical role in enabling reliable unlearning and extrapolation by fostering representational alignment with task-specific structures.
Methodology
The authors use bilinear MLPs, which calculate products between learned linear projections of inputs, to study structural disentanglement. Analytical proofs demonstrate the 'non-mixing' property under gradient flow, while controlled experiments validate the hypothesis across tasks with compositional and algebraic structures, including modular arithmetic, cyclic reasoning, Lie group dynamics, and unlearning benchmarks. Comparisons are made against standard ReLU-based architectures to assess performance in unlearning and generalization.
Results
Bilinear MLPs outperform standard ReLU-based architectures in recovering true operators aligned with task-specific algebraic structures. They exhibit significantly higher selectivity in unlearning and better generalization to long-horizon compositional tasks. Analytical findings confirm that bilinear parameterizations preserve the independence of interaction modes, enabling precise model editing.
Implications
The findings highlight the importance of architectural inductive bias in neural network design, particularly for tasks governed by compositional or algebraic structures. Bilinear MLPs could be applied to improve model editability, selective unlearning, and generalization in domains such as physics (Lie group dynamics), modular arithmetic, and feature interaction modeling. This work also provides a foundation for developing architectures tailored to specific problem regimes, emphasizing representational alignment over post-hoc algorithmic solutions.
View on arXiv

Thermodynamic Limits of Physical Intelligence

Koichi Takahashi, Yusuke Hayashi
  • Introduced thermodynamic epiplexity per joule as a metric for learning efficiency, measuring structural information retained per unit of energy.
  • Defined empowerment per joule as a metric for control efficiency, quantifying sensorimotor channel capacity under energy constraints.
  • Derived Landauer-scale benchmarks for epiplexity acquisition, emphasizing the need for closed-cycle boundary assumptions for accurate comparisons.
  • Recommended compute-bounded MDL surrogates for epiplexity in empirical settings where latent structure variables are unavailable.
  • Proposed a unified efficiency framework for consistent bits-per-joule reporting, addressing boundary, energy accounting, and cost conventions.
Read More
Abstract
This paper explores the thermodynamic efficiency of intelligent systems, proposing two complementary metrics to quantify physical intelligence: thermodynamic epiplexity per joule (recognition efficiency) and empowerment per joule (control efficiency). These metrics aim to measure the bits of information learned or controlled by an agent per unit of energy consumed, under explicit thermodynamic accounting conventions. The authors derive a Landauer-scale benchmark for epiplexity acquisition and clarify the importance of closed-cycle boundary assumptions for meaningful comparisons. They also address practical challenges in empirical settings by recommending compute-bounded Minimum Description Length (MDL) surrogates for epiplexity when latent structure variables are unavailable. A unified efficiency framework is proposed to standardize reporting of energy and boundary conventions, enabling consistent comparisons across AI systems. The work bridges stochastic thermodynamics, information theory, and embodied intelligence, offering insights into the energy constraints and sustainability of AI systems.
Methodology
The authors use stochastic thermodynamics to derive theoretical benchmarks for energy-efficient learning and control. They formalize metrics for thermodynamic epiplexity and empowerment, linking them to information theory and embodied intelligence. Empirical surrogates based on MDL compression are suggested for practical applications, and reporting conventions are standardized to ensure reproducibility.
Results
The paper establishes theoretical benchmarks for thermodynamic epiplexity and empowerment, demonstrating their dependence on explicit boundary and energy accounting conventions. It highlights the role of closed-cycle assumptions in aligning energy dissipation with information gain and provides operational guidelines for empirical measurement using MDL-based surrogates.
Implications
This work has implications for improving the energy efficiency of AI systems, potentially guiding the design of more sustainable and physically efficient intelligent agents. It also offers a framework for comparing AI systems based on their thermodynamic performance, which could influence future research on energy-aware AI development and scaling.
View on arXiv

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed
  • The paper focuses on long-term tail decay rates of SGD and c-SGD in non-convex optimization, addressing gaps in existing finite-time bounds.
  • For vanilla SGD with bounded noise, the long-term tail decay rate is e^(-t/log(t)), which is faster than previously known rates.
  • For c-SGD under heavy-tailed noise with bounded moments of order p ∈ (1, 2], the decay rate is e^(-t^(4(p-1)/(3p-2))/log(t)).
  • The results include tight lower bounds, showing that the derived rates are optimal up to poly-logarithmic factors.
  • The findings provide stronger guarantees for individual training runs, particularly for large-scale models requiring millions of iterations.
Read More
Abstract
This paper investigates the long-term tail behavior of stochastic gradient descent (SGD) and clipped SGD (c-SGD) in non-convex optimization settings. The authors address the limitations of existing finite-time tail bounds by focusing on long-term decay rates, which are more relevant for modern machine learning models requiring millions of iterations. Using large deviations theory, the paper establishes tight upper and lower bounds on the tail decay rates of the gradient norm-squared for both SGD and c-SGD. For vanilla SGD under bounded noise, the tail decay rate is shown to be proportional to e^(-t/log(t)). For c-SGD under heavy-tailed noise with bounded moments of order p ∈ (1, 2], the decay rate is e^(-t^(4(p-1)/(3p-2))/log(t)). These results demonstrate significantly faster tail decay compared to existing finite-time bounds, providing stronger guarantees for the performance of individual training runs. The findings are particularly relevant for large-scale machine learning models, such as large language models, which require extensive training iterations.
Methodology
The authors use large deviations theory to analyze the long-term tail behavior of the gradient norm-squared for SGD and c-SGD. They derive upper and lower bounds on the tail decay rates by studying the probability of failure to reach an epsilon-stationary point. The analysis considers both bounded noise for SGD and heavy-tailed noise with bounded moments for c-SGD.
Results
The paper establishes that for vanilla SGD with bounded noise, the long-term tail decay rate is e^(-t/log(t)). For c-SGD under heavy-tailed noise with bounded moments of order p ∈ (1, 2], the decay rate is e^(-t^(4(p-1)/(3p-2))/log(t)). These rates are significantly faster than those derived from finite-time bounds, such as e^(-sqrt(t)) for SGD and e^(-t^(2(p-1)/(3p-2))) for c-SGD. The results also include tight lower bounds, confirming the optimality of the derived rates up to poly-logarithmic factors.
Implications
The findings have significant implications for training large-scale machine learning models, such as large language models, which require millions of iterations. The tighter long-term tail bounds provide stronger guarantees for the reliability of individual training runs, potentially reducing the risk of failure and improving resource efficiency in large-scale optimization tasks.
View on arXiv

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
  • The paper identifies an implicit advantage symmetry in GRPO's GRAE mechanism, which limits exploration and difficulty adaptation.
  • At the group level, the symmetry prevents the exploration of novel correct solutions by assigning equal weights to correct and incorrect trajectories.
  • At the sample level, GRPO's static focus on medium-difficulty samples leads to suboptimal learning dynamics, neglecting simpler and more complex tasks.
  • The proposed Asymmetric GRAE (A-GRAE) introduces asymmetric exploration and a curriculum-like progression to address these limitations.
  • A-GRAE demonstrates significant performance improvements across seven benchmarks, enhancing GRPO and its variants in reasoning tasks.
Read More
Abstract
This paper investigates the limitations of Group Relative Policy Optimization (GRPO), a widely used reinforcement learning algorithm for reasoning tasks in large language models (LLMs) and multimodal large language models (MLLMs). The authors identify an implicit advantage symmetry in GRPO's Group Relative Advantage Estimation (GRAE) mechanism, which hinders exploration and difficulty adaptation. Specifically, this symmetry results in equal weighting of correct and incorrect trajectories at the group level, limiting the exploration of novel solutions, and a bias toward medium-difficulty samples at the individual level, which prevents effective learning across varying task complexities. To address these issues, the authors propose Asymmetric GRAE (A-GRAE), a novel framework that introduces asymmetric exploration incentives and a curriculum-like learning schedule. A-GRAE dynamically adjusts the focus on sample difficulty and encourages exploration of new trajectories. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves the performance of GRPO and its variants in both natural language and vision-language reasoning tasks.
Methodology
The authors conduct a systematic analysis of GRPO's GRAE mechanism, identifying the implicit advantage symmetry at both group and sample levels. They perform controlled interventions to break this symmetry and analyze its impact on learning dynamics. Based on their findings, they propose A-GRAE, which incorporates asymmetric exploration incentives and a dynamic curriculum-like learning schedule. The framework is evaluated on seven benchmarks, covering both natural language reasoning and vision-language reasoning tasks, using various LLMs and MLLMs.
Results
A-GRAE consistently outperforms GRPO and its variants (e.g., DAPO and Dr.GRPO) across seven benchmarks, achieving significant improvements in key metrics such as accuracy and pass@k. The proposed framework demonstrates superior exploration capabilities and better adaptation to varying task difficulties, leading to enhanced reasoning performance in both LLM and MLLM settings.
Implications
The findings and proposed A-GRAE framework have significant implications for reinforcement learning in large language models and multimodal systems. By addressing the limitations of GRPO, A-GRAE can improve the efficiency and effectiveness of reasoning tasks, particularly in applications requiring complex problem-solving and adaptive learning. This work also highlights the importance of revisiting advantage function design in reinforcement learning algorithms for better exploration and difficulty adaptation.
View on arXiv

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
  • VSD reformulates draft training as a variational inference problem, addressing the misalignment between training and decoding distributions in speculative decoding.
  • The proposed ELBO objective promotes high-quality draft paths and minimizes divergence from the target model's acceptance distribution.
  • An EM-based optimization approach is introduced, leveraging MCMC sampling, ARW, and CAR to enhance training efficiency and reduce variance.
  • VSD achieves significant speedups (up to 9.6%) over state-of-the-art speculative decoding methods like EAGLE-3 and ViSpec.
  • Empirical analysis highlights the inefficiencies of token-level greedy training and demonstrates the benefits of sequence-level optimization.
Read More
Abstract
This paper introduces Variational Speculative Decoding (VSD), a novel framework for improving speculative decoding in large language models (LLMs) and multimodal LLMs (MLLMs). Speculative decoding accelerates inference by using a draft model to propose multiple token sequences, which are then verified by the target model. However, existing methods suffer from a training-decoding discrepancy: training optimizes token-level likelihood along a single greedy path, while decoding evaluates multiple stochastic paths based on sequence-level acceptance. VSD addresses this issue by reformulating draft training as a variational inference problem, maximizing the marginal probability of target-model acceptance. The framework introduces an Evidence Lower Bound (ELBO) objective that balances generating high-quality draft paths and aligning the draft distribution with the target model's acceptance behavior. To optimize the ELBO, the authors propose an Expectation-Maximization (EM) algorithm, incorporating techniques like oracle-filtered MCMC sampling, Adaptive Rejection Weighting (ARW), and Confidence-Aware Regularization (CAR). Experiments demonstrate that VSD significantly improves decoding efficiency, achieving up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, while increasing the average acceptance length of token sequences.
Methodology
The authors propose a variational framework for draft model training, using an ELBO objective to optimize the marginal likelihood of target-model acceptance. The EM algorithm is employed for optimization, with an E-step that samples draft paths using oracle-filtered MCMC and an M-step that maximizes weighted likelihood using ARW and CAR. Theoretical analysis and empirical evaluations are conducted to validate the approach.
Results
VSD improves decoding efficiency, achieving up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec. It also increases the average accepted token sequence length, demonstrating better alignment between training and decoding distributions. The framework reduces the pruning of high-confidence draft paths and enhances the likelihood of longer accepted spans.
Implications
VSD has the potential to significantly accelerate inference in LLMs and MLLMs, making them more efficient for applications like dialogue systems, code generation, and reasoning tasks. By addressing the training-decoding discrepancy, the framework could inspire further research into sequence-level optimization for other generative models.
View on arXiv