gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-09 • Found 24 papers

Assessing Electricity Demand Forecasting with Exogenous Data in Time Series Foundation Models

Wei Soon Cheong, Lian Lian Jiang, Jamie Ng Suat Ling
  • Foundation models exhibit mixed effectiveness in leveraging exogenous features for electricity demand forecasting, with performance varying across models, forecasting horizons, and geographic contexts.
  • Chronos-2 achieves the best performance among foundation models in zero-shot settings, but the baseline LSTM often outperforms all foundation models in stable climates like Singapore.
  • Model architecture plays a critical role in performance, with features like channel-mixing (TinyTimeMixers) and grouped attention (Chronos-2) consistently improving exogenous feature utilization.
  • Foundation models demonstrate advantages primarily in variable climates, suggesting geographic context significantly impacts their effectiveness.
  • The study underscores the need for domain-specific adaptations in time-series foundation models for energy applications.
Read More
Abstract
This paper evaluates the performance of time-series foundation models for electricity demand forecasting, with a focus on their ability to leverage exogenous features such as weather and temporal variables. The study compares several foundation models (MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos-2) against a baseline LSTM with reversible instance normalization across electricity markets in Singapore and Australia. The evaluation is conducted at both hourly and daily granularities under three feature configurations: all features, selected features, and target-only. The results reveal that while foundation models like Chronos-2 perform well in zero-shot settings, the baseline LSTM often outperforms them in stable climates like Singapore, particularly for short-term forecasts. The study highlights the importance of model architecture (e.g., channel-mixing in TinyTimeMixers and grouped attention in Chronos-2) and geographic context in determining the effectiveness of exogenous feature integration. The findings challenge the assumption of universal superiority of foundation models and emphasize the need for domain-specific adaptations in the energy sector.
Methodology
The authors conducted an empirical evaluation of five time-series foundation models (MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos-2) and a baseline LSTM with reversible instance normalization. The models were tested on electricity demand forecasting tasks in Singapore and Australian markets at hourly and daily granularities. Three feature configurations were used: all features, selected features, and target-only. The study assessed the models' ability to incorporate exogenous features and their performance across different geographic and climatic contexts.
Results
Chronos-2 emerged as the best-performing foundation model in zero-shot settings, leveraging grouped attention for multivariate and covariate-informed forecasting. However, the baseline LSTM frequently outperformed all foundation models in Singapore's stable climate, particularly for short-term horizons. Foundation models showed greater advantages in variable climates, such as Australia, where exogenous features like weather play a more significant role. The effectiveness of exogenous feature integration was found to depend heavily on model architecture and geographic context.
Implications
The findings suggest that while time-series foundation models hold promise for electricity demand forecasting, their universal applicability is limited. Domain-specific adaptations and careful consideration of geographic and climatic contexts are essential for maximizing their effectiveness. These insights could guide the development of more specialized forecasting models for the energy sector, particularly in regions with stable or highly variable climates.
View on arXiv

CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning

Ronghao Lin, Qiaolin He, Sijie Mai, Ying Zeng, Aolin Xiong, Li Huang, Yap-Peng Tan, Haifeng Hu
  • CyIN introduces an informative latent space using token- and label-level Information Bottlenecks to enhance multimodal learning by capturing task-relevant features and reducing noise.
  • The framework employs a cross-modal cyclic translation mechanism to reconstruct missing modalities, improving performance in incomplete multimodal scenarios.
  • CyIN unifies complete and incomplete multimodal learning into a single model, eliminating the need for separate models for different missing modality combinations.
  • Extensive experiments on four multimodal datasets demonstrate CyIN's superior performance and robustness across various missing modality scenarios.
  • The framework addresses key limitations of existing multimodal learning methods, such as sensitivity to missing modalities and performance trade-offs between complete and incomplete inputs.
Read More
Abstract
This paper introduces CyIN, a novel framework designed to address the challenges of multimodal learning in scenarios where some modalities may be missing during inference. Traditional multimodal models often assume the availability of all modalities during both training and inference, which is not always feasible in real-world applications. CyIN bridges the gap between complete and incomplete multimodal learning by constructing an informative latent space using token- and label-level Information Bottlenecks (IB). These bottlenecks enable the extraction of task-relevant features while filtering out noise. Additionally, CyIN employs a cross-modal cyclic translation mechanism to reconstruct missing modalities using the remaining ones, ensuring robustness in incomplete multimodal scenarios. The framework is unified, meaning it can handle both complete and incomplete multimodal inputs without requiring separate models for different missing modality combinations. Extensive experiments on four datasets demonstrate that CyIN achieves state-of-the-art performance in both complete and incomplete multimodal learning tasks, showcasing its robustness and generalizability.
Methodology
CyIN constructs an informative latent space using token- and label-level Information Bottlenecks (IB) to capture task-relevant features and remove noise. The token-level IB facilitates low-level multimodal interactions, while the label-level IB introduces high-level semantic guidance. The framework also employs a cross-modal cyclic translation mechanism, which reconstructs missing modalities by leveraging the remaining ones through forward and reverse propagation. This unified approach allows CyIN to optimize both complete and incomplete multimodal learning simultaneously.
Results
CyIN achieves state-of-the-art performance on four multimodal datasets, demonstrating its effectiveness in both complete and incomplete multimodal learning scenarios. The framework shows significant robustness to missing modalities and outperforms existing methods in terms of generalization and handling dynamic missing modality conditions.
Implications
CyIN has significant implications for real-world applications where multimodal data is often incomplete or unreliable, such as healthcare, autonomous driving, and human-computer interaction. By addressing the challenges of missing modalities, CyIN enables more robust and efficient multimodal systems, potentially improving their adoption in dynamic and unpredictable environments.
View on arXiv

Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers

Hao Chen, Jinghui Yuan, Hanmin Zhang
  • Identifies the 'Radial Tug-of-War' issue in AdamW, where parameter norm growth and directional updates conflict, causing radial oscillations and noise.
  • Proposes Orthogonal Dynamics Decoupling (ODD) to separate radial norm control from tangential feature learning.
  • Introduces AdamO, an optimizer that uses SGD-style updates for radial dynamics and adaptive preconditioning for tangential updates.
  • Incorporates curvature-adaptive radial step sizing and architecture-aware rules to handle scale-invariant layers and low-dimensional parameters.
  • Empirical results show that AdamO improves generalization and stability over AdamW across vision and language tasks.
Read More
Abstract
This paper introduces a novel optimization algorithm, AdamO, which addresses the limitations of the widely-used AdamW optimizer. The authors identify a fundamental issue in AdamW, termed the 'Radial Tug-of-War,' where the entanglement of parameter norm growth and directional updates leads to radial oscillations and noise in second-moment estimates. To resolve this, the authors propose Orthogonal Dynamics Decoupling (ODD), which separates the radial (norm) and tangential (directional) dynamics of parameter updates. AdamO implements this decoupling by using an SGD-style update for radial norm control and confining Adam's adaptive preconditioning to the tangential subspace. Additionally, AdamO introduces curvature-adaptive radial step sizing and architecture-aware rules to improve optimization stability and generalization. Experimental results on vision and language tasks demonstrate that AdamO outperforms AdamW in terms of generalization and stability without adding significant computational overhead or complexity.
Methodology
The authors propose a radial-tangential decomposition of parameter updates, where gradients are projected into radial and tangential subspaces. Radial updates are treated as a one-dimensional control problem using SGD-style updates with curvature-adaptive step sizes, while tangential updates use Adam's adaptive preconditioning. The optimizer also applies architecture-aware rules to handle specific parameter structures, such as scale-invariant layers. The method avoids complex constraints by maintaining subspace closure for gradients, states, and updates.
Results
AdamO consistently outperforms AdamW in experiments on vision and language tasks. It achieves better generalization and stability, as evidenced by higher test accuracy and smoother decision boundaries. For example, AdamO achieved a test accuracy of 96.67% compared to 92.00% for AdamW in one of the experiments. Additionally, AdamO resulted in smaller parameter norms and reduced radial oscillations, demonstrating its effectiveness in addressing the Radial Tug-of-War issue.
Implications
The proposed AdamO optimizer has the potential to become a new standard for training deep neural networks, particularly in applications where generalization and stability are critical. Its ability to decouple radial and tangential dynamics could inspire further research into geometry-aware optimization techniques, leading to more efficient and robust training algorithms for large-scale machine learning models.
View on arXiv

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He
  • Introduced KERNELGYM, a robust distributed GPU environment for RL-based kernel generation, addressing challenges like reward hacking and lazy optimization.
  • Proposed Turn-level Reinforce-Leave-One-Out (TRLOO) for unbiased advantage estimation in multi-turn RL training.
  • Developed Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to improve training stability and prevent trivial optimizations.
  • The DR. KERNEL-14B model achieved competitive performance with Claude-4.5-Sonnet and GPT-5, with 31.6% of generated kernels achieving at least a 1.2Ă— speedup over the Torch reference on the KernelBench Level-2 subset.
  • Sequential test-time scaling further improved DR. KERNEL-14B's performance, with 47.8% of kernels achieving a 1.2Ă— speedup when selecting the best candidate across all turns.
Read More
Abstract
This paper introduces DR. KERNEL, a reinforcement learning (RL) framework designed to optimize GPU kernel code generation using Triton, a high-level GPU programming language. The authors address key challenges in RL-based kernel generation, such as reward hacking (where models exploit loopholes in reward systems) and lazy optimization (where models produce trivial but inefficient solutions). To tackle these issues, the authors develop KERNELGYM, a robust distributed GPU environment that supports multi-turn RL training, reward hacking detection, and data collection. They also propose Turn-level Reinforce-Leave-One-Out (TRLOO), an unbiased advantage estimation method for multi-turn RL, and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to improve training stability and prevent trivial optimizations. The resulting model, DR. KERNEL-14B, demonstrates competitive performance with state-of-the-art models like Claude-4.5-Sonnet and GPT-5 on the KernelBench benchmark, achieving significant speedups in GPU kernel execution. The paper also explores sequential test-time scaling, further enhancing the model's performance. All resources, including the environment, training code, models, and datasets, are made publicly available.
Methodology
The authors developed KERNELGYM, a distributed GPU environment for RL training, which includes features like reward hacking checks, multi-turn interaction support, and long-term RL training capabilities. They identified a biased policy gradient issue in GRPO and proposed the TRLOO method for unbiased advantage estimation. To address training stability and prevent lazy optimization, they introduced Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS). The model was trained on Triton kernel generation tasks using these methods and evaluated on the KernelBench benchmark.
Results
DR. KERNEL-14B achieved competitive performance with state-of-the-art models, surpassing Claude-4.5-Sonnet and GPT-5 on the KernelBench Level-2 subset. Specifically, 31.6% of the generated kernels achieved at least a 1.2Ă— speedup over the Torch reference, compared to 26.7% for Claude-4.5-Sonnet and 28.6% for GPT-5. With sequential test-time scaling, the 1.2Ă— speedup rate increased to 47.8%.
Implications
The proposed framework has significant implications for automating GPU kernel optimization, reducing the need for manual engineering expertise, and improving the efficiency of large-scale AI systems. By addressing key challenges in RL-based kernel generation, this work paves the way for more effective and scalable AI-driven code optimization in high-performance computing.
View on arXiv

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang
  • EBPO introduces a shrinkage estimator that combines local group statistics with global historical statistics to stabilize advantage estimation in reinforcement learning.
  • The method addresses key limitations of GRPO, including high variance with small group sizes and vanishing gradients in saturated failure regimes.
  • Theoretical analysis shows that EBPO reduces mean squared error (MSE), prevents vanishing gradients, and ensures bounded entropy decay.
  • Empirical results demonstrate that EBPO consistently outperforms GRPO and other baselines across benchmarks like AIME and OlympiadBench, achieving superior stability and performance.
  • EBPO is particularly effective in resource-constrained settings and benefits from difficulty-stratified curriculum learning.
Read More
Abstract
This paper introduces Empirical Bayes Policy Optimization (EBPO), a novel reinforcement learning framework designed to address stability challenges in Group Relative Policy Optimization (GRPO) for tasks involving verifiable rewards, such as reasoning in large language models (LLMs). GRPO, while computationally efficient, suffers from high variance in advantage estimation with small group sizes and vanishing gradients in failure regimes. EBPO mitigates these issues by employing a shrinkage estimator that combines local group statistics with global historical statistics, dynamically adjusted using Empirical Bayes principles and Welford’s online algorithm. The authors provide theoretical guarantees that EBPO reduces mean squared error (MSE), prevents vanishing gradients, and ensures bounded entropy decay. Empirical evaluations on benchmarks like AIME and OlympiadBench demonstrate that EBPO outperforms GRPO and other baselines, achieving superior training stability, higher performance, and greater sample efficiency, particularly in resource-constrained settings. The framework also benefits from difficulty-stratified curriculum learning, further enhancing its effectiveness.
Methodology
EBPO employs an Empirical Bayes framework to stabilize advantage estimation in reinforcement learning with verifiable rewards. It replaces GRPO's local baseline with a shrinkage estimator that dynamically balances local group statistics with a global prior, which is updated using Welford’s online algorithm. This approach reduces variance and ensures non-zero gradients in failure scenarios. The method is validated through theoretical proofs and empirical evaluations on diverse benchmarks.
Results
EBPO achieves state-of-the-art performance across multiple benchmarks, including AIME and OlympiadBench. It outperforms GRPO by over 11% in resource-constrained settings (group size of 8) and demonstrates superior training stability and sample efficiency. The framework also effectively leverages difficulty-stratified curriculum learning to further enhance performance.
Implications
EBPO has significant implications for improving the stability and efficiency of reinforcement learning in tasks involving verifiable rewards, such as mathematical reasoning and code generation in large language models. Its ability to handle small group sizes and failure regimes makes it particularly valuable for resource-constrained training scenarios. The framework could be extended to other domains where stable and efficient policy optimization is critical.
View on arXiv

Feedback Control for Multi-Objective Graph Self-Supervision

Karish Grover, Theodore Vasiloudis, Han Xie, Sixing Lu, Xiang Song, Christos Faloutsos
  • ControlG introduces a closed-loop control framework for multi-objective graph SSL, addressing challenges like gradient interference, nonstationary utility, and objective starvation.
  • The framework uses temporal allocation, dedicating specific time blocks to individual objectives rather than blending them per step, reducing interference and improving training stability.
  • ControlG employs state estimation signals to measure objective demand and interference, a Pareto-aware planner for budget allocation, and a PID-style feedback controller for execution.
  • The method produces interpretable and auditable schedules, providing insights into the contribution of each objective during training.
  • ControlG consistently outperforms state-of-the-art methods across nine datasets, demonstrating its effectiveness in multi-objective graph SSL.
Read More
Abstract
This paper introduces ControlG, a novel control-theoretic framework for addressing the challenges of multi-objective graph self-supervised learning (SSL). Traditional approaches to multi-task learning in graph SSL often rely on per-step gradient mixing, which can lead to suboptimal compromises between conflicting objectives, resulting in issues such as gradient interference, nonstationary objective utility, and starvation of certain objectives. ControlG addresses these challenges by framing multi-objective coordination as a feedback-controlled temporal allocation problem. Instead of blending objectives at every step, ControlG allocates dedicated time blocks to individual objectives, reducing interference and enabling adaptive prioritization. The framework employs a combination of state estimation, Pareto-aware planning, and feedback-based scheduling to dynamically allocate optimization budgets to different objectives. ControlG demonstrates superior performance across nine datasets, outperforming state-of-the-art baselines while providing interpretable and auditable training schedules.
Methodology
ControlG employs a three-step process: (1) state estimation to measure objective demand (via spectral indicators) and interference (via gradient geometry), (2) Pareto-aware planning to allocate optimization budgets using log-hypervolume sensitivities, and (3) a PID-style feedback controller to convert continuous allocation targets into discrete task schedules. This approach ensures adaptive and fair allocation of training resources to different objectives over time, avoiding common failure modes such as drift, disagreement, and drought.
Results
ControlG outperformed state-of-the-art multi-objective graph SSL methods across nine datasets, demonstrating improved performance and stability. The framework also produced interpretable training schedules, showing how objectives were prioritized and adapted over time based on their utility and interference levels.
Implications
ControlG provides a robust and interpretable solution for multi-objective graph SSL, which could be applied to a wide range of graph-based tasks, such as link prediction, node classification, and graph-level representation learning. Its ability to dynamically adapt to changing objective utilities and resolve gradient conflicts has the potential to improve multi-task learning in other domains beyond graph SSL.
View on arXiv

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

Zifan Liu, Xinran Li, Shibo Chen, Jun Zhang
  • GAS addresses the lack of trajectory stitching in GM-assisted OSRL by augmenting and relabeling datasets at the transition level.
  • Novel goal functions trained with expectile regression estimate optimal reward-cost targets, enabling better tradeoffs between reward maximization and constraint satisfaction.
  • Dataset reshaping ensures a balanced reward-cost distribution, improving training stability and efficiency.
  • GAS achieves superior safety performance under tight constraints and improves reward maximization by 6% under loose constraints.
  • The method bypasses Bellman backups, mitigating OOD issues inherent in traditional OSRL approaches.
Read More
Abstract
This paper introduces Goal-Assisted Stitching (GAS), a novel algorithm designed to address key challenges in Generative Model-assisted Offline Safe Reinforcement Learning (OSRL). OSRL aims to learn policies that maximize rewards while satisfying safety constraints using pre-collected datasets, avoiding risky online exploration. While generative model (GM)-based methods bypass the Out-of-Distribution (OOD) issues of traditional OSRL approaches, they struggle with trajectory stitching and balancing reward maximization with constraint satisfaction under imbalanced reward-cost conditions. GAS enhances stitching capabilities by augmenting and relabeling datasets at the transition level, enabling the construction of high-quality trajectories from suboptimal data. It introduces goal functions trained via expectile regression to estimate optimal reward and cost targets, guiding policy training for better reward-cost tradeoffs. Additionally, GAS reshapes datasets to ensure a more uniform reward-cost distribution, improving training stability and efficiency. Empirical evaluations across 12 scenarios and 8 baselines demonstrate GAS's superior performance in balancing safety and rewards, achieving a 6% improvement under loose constraints and robust safety under tight thresholds.
Methodology
GAS employs three key innovations: (1) Temporal Segmented Return Augmentation and Transition-level Return Relabeling to enhance trajectory stitching, (2) Expectile regression-based goal functions to estimate optimal reward-cost targets, and (3) Dataset reshaping to balance reward-cost distributions. These techniques collectively improve policy training and performance under constrained settings.
Results
Empirical evaluations across 12 scenarios and 8 baselines demonstrate GAS's effectiveness. It outperforms existing methods in balancing reward maximization and constraint satisfaction, achieving a 6% improvement in performance under loose constraints and robust safety under tight thresholds.
Implications
GAS has significant implications for safety-critical applications such as autonomous driving and financial portfolio management, where balancing performance and safety is crucial. Its ability to leverage suboptimal datasets and adapt to varying reward-cost conditions makes it a promising approach for real-world constrained decision-making tasks.
View on arXiv

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Emily Dent, Jared Tanner
  • The paper extends the Edge-of-Chaos (EoC) theory to study the effects of variance on sparsely activated networks, focusing on sparsity-inducing activation functions like CReLUĎ„,m and CSTĎ„,m.
  • Increasing the fixed variance of the Gaussian process improves training stability and accuracy, even with sparsity levels as high as 90%.
  • The proposed approach enables computational efficiency by leveraging sparsity in hidden layers, reducing energy consumption in DNNs and CNNs.
  • The study highlights the importance of variance control in mitigating training instability caused by sparsity-inducing activation functions.
  • The findings provide a theoretical framework for designing sparse neural networks with improved expressivity and stability.
Read More
Abstract
This paper investigates the role of controlling the variance of Gaussian processes in improving the training stability and accuracy of deep neural networks (DNNs) and convolutional neural networks (CNNs) with sparsity-inducing activation functions. The authors extend the Edge-of-Chaos (EoC) theory, which characterizes the intermediate layers of deep networks as Gaussian processes, to study the effects of variance on sparsely activated networks. They propose that increasing the fixed variance of the Gaussian process can enhance the expressivity of networks with high sparsity levels (up to 90%) in hidden layers, while also improving training stability. The study introduces novel sparsity-inducing activation functions, such as shifted and clipped ReLU (CReLUĎ„,m) and CSTĎ„,m, and demonstrates their effectiveness in achieving high sparsity without sacrificing model accuracy. The findings suggest that controlling variance is a promising approach to reduce energy consumption in machine learning models by enabling efficient computation with sparse hidden layers.
Methodology
The authors build on the Edge-of-Chaos (EoC) theory, which models the intermediate layers of deep networks as Gaussian processes. They analyze the impact of fixed Gaussian process variance on training stability and accuracy in networks with sparsity-inducing activation functions. Two specific activation functions, CReLUĎ„,m and CSTĎ„,m, are introduced and evaluated. The study uses mathematical analysis and experiments on DNNs and CNNs to validate the proposed approach.
Results
The experiments demonstrate that increasing the fixed variance of the Gaussian process significantly improves training stability and accuracy in DNNs and CNNs with high sparsity levels (up to 90%). The proposed activation functions, CReLUĎ„,m and CSTĎ„,m, enable consistent training and maintain high accuracy even with extreme sparsity in hidden layers.
Implications
The findings have significant implications for improving the computational efficiency of machine learning models by enabling the use of highly sparse hidden layers. This can lead to reduced energy consumption, making neural networks more suitable for deployment on resource-constrained devices such as edge devices. Additionally, the study provides a theoretical foundation for designing sparse neural networks with enhanced training stability and expressivity, which could benefit a wide range of applications in deep learning.
View on arXiv

Inverse Depth Scaling From Most Layers Being Similar

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore
  • The loss in LLMs scales inversely with depth, meaning that increasing depth reduces loss, but not in a highly efficient manner.
  • Most layers in LLMs function similarly, forming an ensemble that reduces error through averaging, rather than through hierarchical or compositional learning.
  • The study identifies three potential regimes for depth utilization: compositional assembly, procedural assembly, and ensemble averaging, with LLMs predominantly operating in the ensemble averaging regime.
  • Controlled experiments on toy residual networks confirm that ensemble averaging explains the observed inverse depth scaling in LLMs.
  • The findings highlight inefficiencies in how LLMs utilize depth and suggest that architectural innovations are needed to improve depth efficiency.
Read More
Abstract
This paper investigates how depth contributes to the performance of large language models (LLMs) and proposes a quantitative framework to understand the scaling behavior of loss with depth. The authors analyze the behavior of hidden states across layers in LLMs and find that most layers operate in a regime they term 'ensemble averaging,' where layers act as redundant estimators that reduce error through averaging, rather than through compositional or procedural learning. This leads to an inverse proportionality between loss and depth. The study combines empirical analysis of LLMs with controlled experiments on toy residual networks to validate their findings. The results suggest that current LLM architectures use depth inefficiently, and improving their efficiency may require architectural innovations that encourage compositional use of depth.
Methodology
The authors conducted empirical evaluations of hidden state behaviors across layers in LLMs to identify depth utilization patterns. They compared these findings with theoretical expectations from three regimes of depth utilization: compositional assembly, procedural assembly, and ensemble averaging. Additionally, they designed toy model experiments with residual networks to study loss scaling and hidden state dynamics under controlled conditions.
Results
The study finds that LLMs predominantly operate in the ensemble averaging regime, where most layers perform similar transformations, reducing error through averaging. This results in an inverse depth scaling of loss, where loss decreases proportionally to the inverse of depth. The findings suggest that LLMs do not fully exploit the potential of depth for compositional or hierarchical learning.
Implications
The results have significant implications for the design of more efficient LLM architectures. Encouraging compositional use of depth, rather than relying on ensemble averaging, could lead to more efficient models with better performance. This work also advances the understanding of neural scaling laws by providing a detailed framework for analyzing depth scaling in LLMs, which could guide future research in model optimization and architecture design.
View on arXiv

Joint Embedding Variational Bayes

Amin Oji, Paul Fieguth
  • Introduces Variational Joint Embedding (VJE), a probabilistic framework for self-supervised learning that avoids reconstruction and contrastive objectives.
  • Maximizes a symmetric conditional ELBO on paired embeddings, enabling probabilistic representations with anisotropic uncertainty.
  • Uses a Student–t likelihood model with polar decomposition to address training instabilities caused by norm-induced factors.
  • Achieves performance comparable to state-of-the-art non-contrastive methods on standard benchmarks.
  • Outperforms baseline methods in anomaly detection tasks, demonstrating the utility of probabilistic semantics.
Read More
Abstract
This paper introduces Variational Joint Embedding (VJE), a novel framework that combines variational inference with joint embedding for self-supervised learning. Unlike traditional contrastive and non-contrastive methods, which rely on pointwise similarity or discrepancy objectives, VJE employs a probabilistic latent-variable model directly on encoder embeddings. The framework maximizes a symmetric conditional evidence lower bound (ELBO) without requiring input reconstruction or contrastive objectives. VJE uses a heavy-tailed Student–t likelihood model with a polar decomposition to decouple directional and radial factors, addressing norm-induced instabilities during training. Additionally, an amortized inference network parameterizes a diagonal Gaussian variational posterior, enabling feature-wise uncertainty modeling without auxiliary projection heads. The proposed method achieves competitive performance on standard self-supervised learning benchmarks (ImageNet-1K, CIFAR-10/100, STL-10) and demonstrates superior anomaly detection capabilities on CIFAR-10, showcasing its ability to produce probabilistic representations with calibrated uncertainty.
Methodology
The authors propose a latent-variable model that defines a normalized probabilistic likelihood directly on encoder embeddings. The training objective maximizes a symmetric conditional ELBO, where an amortized inference network parameterizes a diagonal Gaussian variational posterior. The likelihood is instantiated using a Student–t model with polar decomposition to decouple directional and radial factors, mitigating training instabilities. The framework avoids auxiliary projection heads and contrastive objectives, focusing on probabilistic representation learning.
Results
VJE achieves competitive performance with leading non-contrastive self-supervised learning methods on benchmarks such as ImageNet-1K, CIFAR-10/100, and STL-10 under linear and k-NN evaluation protocols. Additionally, it demonstrates superior performance in anomaly detection on CIFAR-10, leveraging its probabilistic modeling capabilities for likelihood-based scoring.
Implications
The VJE framework provides a principled alternative to traditional self-supervised learning methods by introducing probabilistic semantics into representation learning. This has potential applications in uncertainty-sensitive domains such as medical diagnosis, anomaly detection, and reinforcement learning, where calibrated uncertainty and density-based scoring are critical for reliability and interpretability.
View on arXiv

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim
  • The paper formalizes Reliable Audio-Visual Question Answering (R-AVQA) as a selective prediction problem, emphasizing abstention over incorrect answers.
  • Adaptive Confidence Refinement (ACR) is proposed as a lightweight, learnable confidence estimation framework that refines MSP-based confidence scores.
  • ACR introduces two novel components: a Residual Risk Head for capturing residual uncertainty and a Confidence Gating Head for assessing MSP reliability.
  • The method achieves state-of-the-art performance on three AVQA datasets and demonstrates robustness under in- and out-of-distribution settings.
  • This work establishes new benchmarks and baselines for R-AVQA, advancing the field of reliable multimodal reasoning.
Read More
Abstract
This paper introduces the concept of Reliable Audio-Visual Question Answering (R-AVQA), a framework that prioritizes abstention over incorrect answers to improve the reliability of AVQA systems. The authors propose Adaptive Confidence Refinement (ACR), a novel method for confidence estimation that enhances the reliability of AVQA models. ACR builds on the Maximum Softmax Probability (MSP) baseline by introducing two learned components: a Residual Risk Head to predict residual uncertainty and a Confidence Gating Head to assess the trustworthiness of MSP. This approach addresses the limitations of existing AVQA models, which often fail to abstain from answering when uncertain, leading to overconfident and incorrect predictions. ACR is evaluated across three AVQA datasets and multiple backbone architectures, demonstrating state-of-the-art performance in risk-coverage trade-offs and robustness under distribution shifts and data biases.
Methodology
The authors propose Adaptive Confidence Refinement (ACR), which enhances the Maximum Softmax Probability (MSP) baseline by integrating two learned components: a Residual Risk Head to predict low-magnitude residual uncertainty and a Confidence Gating Head to determine the reliability of MSP. ACR uses multimodal features and pre-softmax logits to refine confidence scores. The method is evaluated on three AVQA datasets using three representative backbone architectures, comparing its performance against existing baselines such as MSP, Monte Carlo Dropout (MCD), and other calibration-based methods.
Results
ACR consistently outperforms existing methods in risk-coverage trade-offs across three AVQA datasets (MUSIC-AVQA, MUSIC-AVQA-R, and MUSIC-AVQA-v2.0) and three backbone models. It demonstrates superior reliability under both in-distribution and out-of-distribution settings, as well as in scenarios with data biases. For example, ACR enables models to maintain a low error rate (e.g., 1%) while significantly increasing the coverage of answered questions compared to state-of-the-art methods.
Implications
The proposed ACR framework has significant implications for the development of reliable multimodal AI systems, particularly in applications where incorrect answers could have severe consequences, such as assistive technologies for individuals with sensory impairments. By enabling models to abstain from answering when uncertain, ACR enhances the trustworthiness and practical utility of AVQA systems in real-world scenarios.
View on arXiv

Laplacian Representations for Decision-Time Planning

Dikshant Shehmar, Matthew Schlegel, Matthew E. Taylor, Marlos C. Machado
  • The paper highlights the importance of state representations in decision-time planning, emphasizing the need to preserve both local and long-horizon structures.
  • Laplacian representations, derived from the eigenvectors of the graph Laplacian, provide a latent space that captures state-space distances at multiple time scales.
  • The proposed ALPS algorithm uses Laplacian representations for hierarchical planning, enabling subgoal discovery and mitigating compounding errors in long-horizon tasks.
  • ALPS outperforms model-free RL baselines on goal-conditioned tasks from the OGBench benchmark, demonstrating its practical effectiveness.
  • The approach leverages the Augmented Lagrangian Laplacian Objective (ALLO) to efficiently learn Laplacian representations from sampled data.
Read More
Abstract
This paper addresses the challenge of decision-time planning in model-based reinforcement learning (RL), focusing on the importance of state representations that support both local cost computation and long-horizon planning. The authors propose the use of Laplacian representations, which embed states into a latent space based on the eigenvectors of the graph Laplacian induced by the environment's dynamics. These representations preserve meaningful state-space distances across multiple time scales, enabling effective subgoal discovery and mitigating compounding errors in long-horizon planning. Building on these properties, the authors introduce a hierarchical planning algorithm called Augmented Laplacian Planning with Subgoals (ALPS). ALPS leverages the Laplacian representation to identify subgoals and estimate distances, facilitating hierarchical decision-time planning. Empirical evaluations on offline goal-conditioned RL tasks from the OGBench benchmark demonstrate that ALPS outperforms commonly used model-free RL baselines, showcasing its effectiveness in long-horizon tasks.
Methodology
The authors employ Laplacian representations, which are derived from the eigenvectors of the graph Laplacian of the environment's dynamics. These representations are learned using the Augmented Lagrangian Laplacian Objective (ALLO) to overcome computational challenges. The ALPS algorithm is then introduced, which uses the Laplacian representation to identify subgoals and estimate distances for hierarchical decision-time planning. The algorithm is evaluated on offline goal-conditioned RL tasks from the OGBench benchmark.
Results
The proposed ALPS algorithm demonstrated superior performance compared to commonly used model-free RL baselines on a suite of offline goal-conditioned RL tasks from the OGBench benchmark. The Laplacian representation effectively decomposed long-horizon tasks into subgoals, enabling more efficient and accurate planning.
Implications
The use of Laplacian representations in decision-time planning has the potential to improve the efficiency and accuracy of model-based RL, particularly in long-horizon tasks. This approach could be applied to various domains, including robotics, autonomous navigation, and complex decision-making systems, where hierarchical planning and subgoal discovery are critical.
View on arXiv

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

Shizheng Wen, Mingyuan Chi, Tianwei Yu, Ben Moseley, Mike Yan Michelis, Pu Ren, Hao Sun, Siddhartha Mishra
  • Introduces TensorGalerkin, a GPU-optimized framework for efficient Galerkin assembly of stiffness matrices and load vectors.
  • Reformulates Galerkin assembly as a tensorized Map-Reduce operation, minimizing Python overhead and maximizing computational efficiency.
  • Supports three main applications: numerical PDE solving (TensorMesh), physics-informed operator learning (TensorPils), and PDE-constrained optimization (TensorOpt).
  • Demonstrates significant computational efficiency and accuracy gains over traditional FEM solvers and physics-informed machine learning methods.
  • Addresses key challenges in low-data regimes, out-of-distribution generalization, and dynamic mesh handling.
Read More
Abstract
This paper introduces TensorGalerkin, a high-performance framework for solving, optimizing, and learning partial differential equations (PDEs) with a variational structure. The framework addresses computational bottlenecks in traditional finite element methods (FEM) and physics-informed machine learning approaches by reformulating Galerkin assembly as a tensorized Map-Reduce operation. TensorGalerkin achieves significant efficiency improvements by leveraging GPU-optimized operations, minimizing Python overhead, and avoiding the inefficiencies of traditional FEM assembly methods. The framework is implemented in PyTorch and supports three main applications: numerical PDE solving (TensorMesh), physics-informed operator learning (TensorPils), and PDE-constrained optimization (TensorOpt). Extensive benchmarks on 2D and 3D elliptic, parabolic, and hyperbolic PDEs demonstrate TensorGalerkin's superior computational efficiency and accuracy compared to existing methods. The framework is particularly effective for dynamic meshes, low-data regimes, and out-of-distribution scenarios, making it a versatile tool for scientific computing and machine learning applications.
Methodology
The authors develop TensorGalerkin, which reformulates Galerkin assembly as a tensorized Map-Reduce operation. The Map stage evaluates local stiffness matrices and load vectors using dense tensor contractions, while the Reduce stage handles domain topology via deterministic sparse matrix multiplication. This approach eliminates Python-level bottlenecks and leverages GPU acceleration. The framework is implemented in PyTorch and integrates analytical shape gradients to avoid the inefficiencies of automatic differentiation.
Results
TensorGalerkin achieves significant computational efficiency and accuracy improvements over traditional FEM solvers and physics-informed machine learning methods. Benchmarks on 2D and 3D elliptic, parabolic, and hyperbolic PDEs demonstrate its ability to handle dynamic meshes and outperform legacy CPU-based FEM solvers. The framework also excels in low-data and out-of-distribution scenarios, providing robust solutions for PDE-constrained optimization and physics-informed learning.
Implications
TensorGalerkin has the potential to revolutionize scientific computing and machine learning applications involving PDEs. Its efficiency and versatility make it suitable for tasks such as real-time numerical simulations, inverse design, uncertainty quantification, and physics-informed learning in low-data settings. By addressing key bottlenecks in FEM and physics-informed methods, TensorGalerkin could enable faster and more accurate solutions for complex physical systems across various scientific and engineering domains.
View on arXiv

MAGPrompt: Message-Adaptive Graph Prompt Tuning for Graph Neural Networks

Long D. Nguyen, Binh P. Nguyen
  • MAGPrompt introduces a novel approach to graph prompt tuning by directly modifying the message-passing process in GNNs.
  • The framework uses learnable prompt parameters to reweight neighbor messages and inject task-specific signals during aggregation.
  • MAGPrompt is compatible with common GNN architectures (e.g., GCN, GIN) and various pre-training strategies, ensuring broad applicability.
  • Experimental results show that MAGPrompt outperforms prior graph prompt tuning methods in few-shot settings and matches fine-tuning performance in full-shot scenarios.
  • The method addresses limitations of existing approaches by enabling task-specific control over neighborhood interactions in GNNs.
Read More
Abstract
This paper introduces MAGPrompt, a novel framework for message-adaptive graph prompt tuning that enhances the adaptability of pre-trained Graph Neural Networks (GNNs) to downstream tasks. Unlike existing graph prompt tuning methods that focus on modifying node features, hidden representations, or graph topology, MAGPrompt directly intervenes in the message-passing process of GNNs. It achieves this by introducing learnable prompt parameters that reweight messages from neighboring nodes and inject task-specific prompt signals during message aggregation, while keeping the backbone GNN frozen. This approach enables task-specific control over neighborhood interactions, addressing the limitations of prior methods that fail to adapt message-passing mechanisms. MAGPrompt is compatible with common GNN architectures and pre-training strategies, making it versatile across various node- and graph-level tasks. Experimental results demonstrate that MAGPrompt outperforms existing graph prompt tuning methods in few-shot settings and achieves performance comparable to full fine-tuning in full-shot scenarios.
Methodology
MAGPrompt introduces lightweight, learnable prompt parameters into the message-passing process of pre-trained GNNs. These parameters reweight incoming messages from neighboring nodes and add task-specific prompt vectors during message aggregation. The framework is designed to work with common GNN architectures like GCN and GIN and supports various pre-training strategies. By freezing the backbone GNN and focusing on task-specific message modulation, MAGPrompt avoids the need for extensive fine-tuning while improving adaptability to downstream tasks.
Results
MAGPrompt consistently outperforms prior representation-based graph prompt tuning methods in few-shot settings across diverse node- and graph-level datasets. It also achieves competitive performance compared to full fine-tuning in full-shot scenarios, demonstrating its effectiveness as a parameter-efficient alternative for adapting pre-trained GNNs.
Implications
MAGPrompt has significant implications for improving the adaptability and efficiency of pre-trained GNNs in various applications, such as social network analysis, traffic forecasting, and biomedical informatics. By enabling task-specific control over message passing, it addresses the limitations of existing graph prompt tuning methods and reduces the reliance on extensive fine-tuning, making it a valuable tool for resource-constrained settings and few-shot learning scenarios.
View on arXiv

Perception-Based Beliefs for POMDPs with Visual Observations

Miriam Schäfers, Merlijn Krale, Thiago D. Simão, Nils Jansen, Maximilian Weininger
  • PBP separates visual observation interpretation from belief-based planning by using a perception model to map images to probability distributions over states.
  • Uncertainty quantification is integrated into the framework to improve robustness against visual corruption, using threshold-based and weighting-based methods.
  • PBP is compatible with existing POMDP solvers, enabling scalability to high-dimensional observation spaces while maintaining competitive performance compared to state-of-the-art VPOMDP solvers.
Read More
Abstract
This paper introduces the Perception-Based Beliefs for POMDPs (PBP) framework, designed to address the challenges of solving partially observable Markov decision processes (POMDPs) with high-dimensional visual observations, referred to as vision POMDPs (VPOMDPs). Traditional POMDP solvers struggle with massive observation spaces, such as camera images, making them impractical for real-world applications like autonomous driving. PBP tackles this issue by integrating a perception model, specifically an image classifier, to map visual observations into probability distributions over states. These distributions are then used for belief updates, enabling efficient planning without explicitly reasoning over the entire observation space. The framework also incorporates uncertainty quantification to handle classifier imprecision, introducing two methods—threshold-based and weighting-based—to improve robustness against visual corruption. The authors empirically evaluate PBP using classical POMDP solvers (HSVI and POMCP) on benchmarks with small state and action spaces but complex visual observations. Results demonstrate that PBP outperforms end-to-end deep reinforcement learning methods and enhances robustness under corrupted visual inputs, making it a promising approach for safety-critical applications.
Methodology
The authors propose a novel belief update rule that incorporates predictions from an image classifier into POMDP solvers. They implement two uncertainty quantification methods to adjust belief updates based on the classifier's confidence. The framework is tested using two classical POMDP solvers, HSVI and POMCP, on benchmarks with complex visual observations but manageable state and action spaces.
Results
Empirical evaluations show that PBP achieves competitive performance compared to state-of-the-art VPOMDP solvers, particularly under conditions of visual corruption. PBP also outperforms end-to-end deep reinforcement learning methods in terms of robustness and interpretability.
Implications
The PBP framework has significant implications for real-world applications of POMDPs, such as autonomous driving and robotics, where high-dimensional visual observations are common. By improving scalability and robustness, PBP enables safer and more reliable decision-making in environments with partial observability and visual noise.
View on arXiv

Position: Capability Control Should be a Separate Goal From Alignment

Shoaib Ahmed Siddiqui, Eleni Triantafillou, David Krueger, Adrian Weller
  • Capability control should be treated as a separate goal from alignment, as it focuses on defining and enforcing operational boundaries on model behavior.
  • The authors propose a three-layer framework for capability control: data-based, learning-based, and system-based interventions.
  • A defense-in-depth approach is recommended to combine complementary controls across the model lifecycle.
  • Challenges include the dual-use nature of knowledge, adversarial attacks, and the probabilistic limitations of current machine learning paradigms.
  • Capability control is essential to mitigate risks of misuse and failure in foundation models, particularly in real-world deployments.
Read More
Abstract
This position paper argues that capability control—imposing operational limits on permissible behaviors of foundation models—should be treated as a distinct goal from alignment, which focuses on aligning models with human values and preferences. The authors propose a framework for capability control across three layers of the model lifecycle: (i) data-based control, which involves filtering or curating training data; (ii) learning-based control, which modifies model weights or representations during training; and (iii) system-based control, which enforces guardrails at the deployment stage. They advocate for a defense-in-depth approach that combines these layers to address the limitations of each when used in isolation. The paper highlights challenges such as the dual-use nature of knowledge, adversarial circumvention, and the probabilistic nature of machine learning systems. The authors emphasize that while capability control is often treated as part of alignment, it requires separate attention due to its distinct objectives and operational challenges.
Methodology
The paper organizes capability control mechanisms into three layers: (i) data-based control, which involves filtering, curating, or synthesizing training data to limit undesirable behaviors; (ii) learning-based control, which includes techniques like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), adversarial training, and unlearning to modify model weights or representations; and (iii) system-based control, which applies post-deployment guardrails such as input/output filters, chain-of-thought (CoT) monitors, and information-flow policies. The authors advocate for combining these methods to create a robust defense-in-depth strategy.
Results
As a position paper, the work does not present empirical results but provides a conceptual framework for understanding and implementing capability control. It identifies the limitations of current approaches and outlines open challenges for future research.
Implications
The paper highlights the importance of treating capability control as a distinct goal to reduce risks associated with foundation models, such as malicious misuse and unintended failures. The proposed framework could guide the development of safer AI systems, particularly in high-stakes applications like autonomous vehicles, healthcare, and content moderation. By addressing the challenges of capability control, the research could contribute to more robust and secure AI deployments in the real world.
View on arXiv

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
  • Introduced Ď€-Distill, a joint teacher-student training framework, and OPSD, an RL-based alternative, for leveraging privileged information during training.
  • Demonstrated that these methods outperform industry-standard SFT+RL baselines, even when the baselines have access to full Chain-of-Thought (CoT) reasoning.
  • Showcased strong generalization capabilities to out-of-distribution tasks across multiple benchmarks and environments.
  • Analyzed the critical factors for effective learning with PI, including the importance of mitigating distributional shifts and maximizing the utility of privileged information.
  • Proposed a framework for transforming frontier model trajectories into varying types of privileged information with different levels of information density.
Read More
Abstract
This paper introduces two novel methods, π-Distill and On-Policy Self-Distillation (OPSD), for leveraging training-time privileged information (PI) to improve the performance of language models in multi-turn, agentic environments. Privileged information refers to additional context or data available during training but not at inference time. The authors address the challenge of transferring capabilities learned with PI to test-time policies that must operate without it. π-Distill employs a joint teacher-student framework where a PI-conditioned teacher and an unconditioned student share parameters and are trained simultaneously, enabling the transfer of privileged knowledge. OPSD, on the other hand, uses reinforcement learning (RL) with a reverse KL penalty to align the student with the PI-conditioned teacher. The proposed methods are evaluated on tasks such as Travel Planner and τ-Bench retail, as well as out-of-domain generalization tasks, demonstrating significant improvements over standard supervised fine-tuning (SFT) and RL baselines, even when these baselines have access to full Chain-of-Thought (CoT) reasoning, which the proposed methods do not require. The study also provides an in-depth analysis of the factors that influence effective learning with PI, such as maximizing PI utility and mitigating distributional shifts between teacher and student.
Methodology
The authors propose two complementary methods: (1) π-Distill, which trains a PI-conditioned teacher and an unconditioned student jointly within a shared-parameter model, and (2) OPSD, which uses reinforcement learning with a reverse KL penalty to align the student with the PI-conditioned teacher. Both methods are designed to transfer knowledge from training-time privileged information to test-time policies that operate without it. The methods are evaluated on multi-turn agentic environments using benchmarks such as Travel Planner, τ-Bench retail, and GEM tool-use tasks.
Results
The proposed methods, π-Distill and OPSD, consistently outperform standard SFT+RL baselines across multiple benchmarks, including Travel Planner and τ-Bench retail. π-Distill is particularly effective, achieving superior performance even without access to full Chain-of-Thought reasoning, which is typically required by standard baselines. The methods also demonstrate strong generalization to out-of-distribution tasks, achieving significant gains on eight additional GEM tool-use environments.
Implications
The proposed methods have significant implications for improving the performance of language models in complex, multi-turn, and agentic tasks where full reasoning traces are unavailable. By enabling effective use of privileged information during training, these methods could enhance the capabilities of language models in real-world applications such as planning, decision-making, and interactive systems. Additionally, the ability to generalize to out-of-distribution tasks suggests potential for broader applicability in diverse domains.
View on arXiv

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski
  • SLAY introduces a geometry-aware attention mechanism based on the Yat-kernel, which couples alignment and proximity of query-key pairs.
  • The method enforces unit-norm constraints on queries and keys, ensuring attention depends only on angular alignment.
  • Using Bernstein's Theorem, the Yat-kernel is reformulated into a nonnegative mixture of polynomial-exponential kernels, enabling linear-time computation.
  • SLAY achieves performance comparable to standard softmax attention while outperforming prior linear-time mechanisms like Performers and Cosformers.
  • The approach enables scalable Transformers for long-context tasks without sacrificing accuracy or stability.
Read More
Abstract
This paper introduces SLAY (Spherical Linearized Attention with Yat-Kernels), a novel linear-time attention mechanism designed to address the computational inefficiencies of standard softmax attention in Transformers. SLAY builds on the Yat-kernel, a geometry-aware similarity function inspired by inverse-square interactions in physics, which couples alignment and proximity of query-key pairs. By constraining queries and keys to the unit sphere and leveraging Bernstein's Theorem, the authors reformulate the Yat-kernel into a nonnegative mixture of polynomial-exponential product kernels. This enables a strictly positive random-feature approximation, achieving linear time complexity (O(L)) for attention computation. SLAY retains the expressiveness and performance of softmax attention while outperforming existing linear-time attention mechanisms like Performers and Cosformers. The proposed method is validated on various tasks, including language and vision, demonstrating competitive quality and significant computational efficiency.
Methodology
SLAY leverages the Yat-kernel, a geometry-aware similarity function, and constrains queries and keys to the unit sphere. The kernel is reformulated using Bernstein's Theorem into a nonnegative mixture of polynomial-exponential product kernels. This is further approximated using strictly positive random features, enabling linear-time attention computation. The method is tested empirically on language and vision tasks, comparing its performance to both standard softmax attention and other linearized attention mechanisms.
Results
SLAY achieves nearly indistinguishable performance from standard softmax attention while maintaining linear time and memory scaling. It consistently outperforms prior linear-time attention mechanisms like Performers and Cosformers in terms of both accuracy and computational efficiency. Empirical evaluations demonstrate its effectiveness across various tasks, including language and vision, with significant speed improvements.
Implications
SLAY has the potential to enable scalable Transformers for long-context tasks, such as natural language processing and computer vision, without the computational bottlenecks of quadratic attention mechanisms. Its geometry-aware design and linear-time complexity make it a promising approach for resource-constrained environments and applications requiring efficient processing of large-scale data.
View on arXiv

Shared LoRA Subspaces for almost Strict Continual Learning

Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille
  • Introduces 'Share,' a parameter-efficient continual learning method that dynamically updates a shared low-rank subspace for large pretrained models.
  • Achieves up to 100Ă— parameter reduction and 281Ă— memory savings compared to traditional LoRA methods.
  • Supports replay-free, almost strict continual learning without requiring multiple task-specific adapters or increased model size.
  • Facilitates forward and backward knowledge transfer while minimizing catastrophic forgetting.
  • Validated across multiple tasks and modalities, demonstrating scalability and effectiveness for real-world applications.
Read More
Abstract
This paper introduces 'Share,' a novel approach to Parameter-Efficient Continual Finetuning (PaCT) for large pretrained models. Share addresses the challenges of catastrophic forgetting and inefficiency in traditional fine-tuning methods by dynamically learning and updating a shared low-rank subspace. This shared subspace captures core knowledge from past tasks and incrementally integrates new information, enabling forward and backward knowledge transfer while minimizing interference. Unlike existing methods, Share does not rely on data replay, multiple task-specific adapters, or increased model size, making it suitable for almost strict continual learning. The approach achieves significant computational and memory efficiency, with up to 100Ă— parameter reduction and 281Ă— memory savings compared to traditional LoRA methods, while maintaining performance comparable to jointly trained models. Share is validated across diverse tasks, including image classification, natural language understanding, 3D pose estimation, and text-to-image generation, demonstrating its scalability and effectiveness for lifelong learning in large-scale AI systems.
Methodology
The proposed method, Share, constructs and dynamically updates a shared low-rank subspace that captures essential knowledge from past tasks and integrates new information. The approach initializes the subspace using existing LoRA adapters and fine-tunes only a small number of principal coefficients for each new task. Knowledge from new tasks is projected into the shared subspace, and older knowledge is analytically reprojected to prevent forgetting. This enables efficient forward and backward knowledge transfer without requiring data replay or additional task-specific adapters.
Results
Share achieves up to 100Ă— reduction in trainable parameters and 281Ă— memory savings compared to traditional LoRA methods. It maintains performance comparable to jointly trained models while supporting scalable, asynchronous continual learning. Experiments across diverse tasks, including image classification, 3D pose estimation, natural language understanding, and text-to-image generation, validate its effectiveness and practicality.
Implications
Share provides a scalable and efficient solution for lifelong learning in large-scale AI systems, reducing computational and memory requirements while addressing catastrophic forgetting. Its ability to support almost strict continual learning without data replay or task-specific adapters makes it suitable for real-world applications, including resource-constrained environments and hybrid task streams. This approach could significantly enhance the adaptability and sustainability of large pretrained models in various domains.
View on arXiv

TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
  • TurboBoA introduces joint quantization of multiple out-channels, reducing sequential bottlenecks and achieving over a three-fold speedup compared to BOA.
  • A novel error correction mechanism mitigates error propagation from previously quantized layers, improving overall accuracy.
  • Adaptive grid computation with iterative refinement ensures alignment between quantization grids and updated weights, reducing attention reconstruction errors.
  • TurboBoA achieves state-of-the-art results in both weight-only and weight-activation quantization when combined with outlier suppression techniques.
  • The algorithm is backpropagation-free, making it computationally efficient and suitable for billion-scale LLMs.
Read More
Abstract
TurboBoA is a novel post-training quantization (PTQ) algorithm designed to improve the efficiency and accuracy of quantizing large language models (LLMs). While existing methods like GPTQ are efficient, they suffer from accuracy degradation in low-bit quantization due to their assumption of layer-wise independence. BOA addresses this issue by incorporating inter-layer dependencies within attention modules, but its sequential quantization process introduces significant computational overhead. TurboBoA builds on BOA by introducing three key innovations: (1) joint quantization of multiple out-channels with a closed-form error compensation rule, which accelerates the process by over three times; (2) a mechanism to correct errors propagated from previously quantized layers, reducing cumulative inaccuracies; and (3) adaptive grid computation with iterative refinement to align quantization grids with updated weights. Extensive experiments demonstrate that TurboBoA not only significantly outperforms BOA in terms of speed but also achieves state-of-the-art accuracy in both weight-only and weight-activation quantization when combined with outlier suppression techniques.
Methodology
TurboBoA builds on the BOA framework by introducing three key innovations: (1) joint quantization of multiple out-channels with a closed-form error compensation rule to reduce sequential operations, (2) a correction mechanism to address error propagation from previously quantized layers, and (3) adaptive grid computation with coordinate descent refinement to iteratively align quantization grids with updated weights. These methods are designed to accelerate the quantization process while maintaining or improving accuracy. The algorithm is evaluated on large-scale LLMs using extensive experiments.
Results
TurboBoA achieves a more than three-fold speedup compared to BOA while maintaining or improving accuracy. It delivers state-of-the-art results in both weight-only and weight-activation quantization when combined with outlier suppression techniques. Experimental results demonstrate its effectiveness across various LLM architectures, particularly in low-bit quantization regimes (e.g., INT2).
Implications
TurboBoA has significant implications for deploying large language models on resource-constrained hardware. Its ability to perform efficient and accurate quantization without backpropagation makes it highly suitable for real-world applications, such as edge computing, mobile devices, and other environments with limited computational resources. Additionally, its improvements in low-bit quantization could enable faster inference and reduced memory usage, facilitating broader adoption of LLMs in various industries.
View on arXiv

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
  • The paper identifies 'implicit advantage symmetry' in GRAE as a core limitation of GRPO, hindering exploration and difficulty adaptation.
  • Asymmetric GRAE (A-GRAE) is proposed to address these issues by introducing asymmetric exploration incentives and a curriculum-like learning schedule.
  • Controlled experiments reveal that breaking the symmetry in GRAE improves exploration and learning efficiency.
  • A-GRAE achieves consistent performance improvements across seven benchmarks, including natural language and vision-language reasoning tasks.
  • The findings prompt a rethinking of advantage function design in reinforcement learning with verifiable rewards (RLVR).
Read More
Abstract
This paper investigates the limitations of Group Relative Policy Optimization (GRPO), a reinforcement learning framework widely used for reasoning tasks in large language models (LLMs) and multi-modal language models (MLLMs). The authors identify a previously overlooked issue in GRPO's Group Relative Advantage Estimation (GRAE) mechanism, termed 'implicit advantage symmetry.' This symmetry results in two critical problems: (i) limited exploration due to equal weighting of correct and incorrect trajectories at the group level, and (ii) suboptimal learning dynamics caused by a static focus on medium-difficulty samples. To address these issues, the authors propose Asymmetric GRAE (A-GRAE), which introduces asymmetric weighting to encourage exploration and employs a curriculum-like learning progression to adapt to varying task difficulties. Extensive experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants, achieving significant gains in reasoning performance and key metrics such as accuracy and pass@k.
Methodology
The authors conduct a systematic analysis of GRAE's implicit advantage symmetry at both group and sample levels, using theoretical and empirical methods. They design controlled interventions to break this symmetry and evaluate the impact on learning dynamics. A-GRAE is introduced as a novel framework that incorporates asymmetric weighting for trajectory exploration and a curriculum-like progression for sample difficulty adaptation. The proposed method is tested on seven benchmarks using various LLMs and MLLMs, comparing its performance against GRPO and its variants.
Results
A-GRAE consistently outperforms GRPO and its variants across seven benchmarks, demonstrating significant improvements in reasoning performance metrics such as accuracy and pass@k. The proposed method enhances exploration of novel solutions and optimizes learning efficiency by dynamically adapting to task difficulty. These results validate the effectiveness of the proposed modifications to the GRAE mechanism.
Implications
The findings have significant implications for the design of reinforcement learning algorithms in the context of large language models and multi-modal reasoning tasks. By addressing the limitations of GRPO, A-GRAE can improve the efficiency and adaptability of RLVR-based systems, enabling better performance on complex reasoning tasks. This work also highlights the importance of revisiting and refining core components of reinforcement learning frameworks to address fundamental limitations.
View on arXiv

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
  • VSD reformulates draft training as a variational inference problem, targeting sequence-level acceptance rather than token-level likelihood.
  • The framework introduces an ELBO objective to balance high-quality draft generation and alignment with the target model's distribution.
  • An EM algorithm is developed to optimize the ELBO, leveraging techniques like MCMC sampling, ARW, and CAR to enhance training efficiency.
  • VSD resolves the training-decoding discrepancy, improving acceptance length and inference speedup in speculative decoding.
  • Experiments show up to 9.6% speedup over state-of-the-art methods like EAGLE-3 and ViSpec, with consistent improvements across LLMs and MLLMs.
Read More
Abstract
This paper introduces Variational Speculative Decoding (VSD), a novel framework for improving speculative decoding in large language models (LLMs) and multimodal LLMs (MLLMs). Speculative decoding accelerates inference by using a draft model to propose multiple tokens, which are then verified by the target model. However, existing methods suffer from a training-decoding discrepancy: training optimizes token-level likelihood along a single greedy path, while decoding operates over stochastic multi-path sampling and ranks paths based on sequence-level acceptance. VSD addresses this discrepancy by reformulating draft training as a variational inference problem, maximizing the marginal likelihood of target-model acceptance. The framework introduces an Evidence Lower Bound (ELBO) that balances generating high-quality draft paths and aligning the draft distribution with the target model's preferences. To optimize the ELBO, the authors propose an Expectation-Maximization (EM) algorithm, incorporating techniques like oracle-filtered MCMC sampling, Adaptive Rejection Weighting (ARW), and Confidence-Aware Regularization (CAR). Theoretical analysis demonstrates that VSD improves expected acceptance length and decoding speedup. Experiments across various LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly enhancing decoding efficiency without compromising quality.
Methodology
The authors propose a variational inference-based training framework for speculative decoding. They derive an ELBO objective to optimize the marginal likelihood of target-model acceptance and use an Expectation-Maximization algorithm for optimization. The E-step involves oracle-filtered MCMC sampling to approximate the variational posterior, while the M-step employs Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR) to maximize weighted likelihood and reduce variance.
Results
VSD achieves significant improvements in decoding efficiency, with up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec. The framework also increases the average accepted token length and reduces the discrepancy between training and decoding distributions. Empirical evaluations on benchmarks like MT-bench, HumanEval, and GSM8K confirm the effectiveness of VSD across LLMs and MLLMs.
Implications
VSD has the potential to enhance the efficiency of LLMs and MLLMs in applications requiring fast and accurate inference, such as real-time dialogue systems, code generation, and complex reasoning tasks. By addressing the training-decoding discrepancy, VSD could pave the way for more robust and scalable speculative decoding methods in large-scale AI systems.
View on arXiv

Verification of the Implicit World Model in a Generative Model via Adversarial Sequences

András Balogh, Márk Jelasity
  • Introduces an adversarial framework to evaluate the soundness of implicit world models in generative models.
  • Focuses on chess as a domain with a well-defined rule-based world model for empirical evaluation.
  • Finds that none of the tested models are fully sound, but training recipes and dataset quality significantly impact performance.
  • Demonstrates that linear board state probes have limited causal influence on model predictions.
  • Highlights the need for improved training techniques to better align generative models with formal world models.
Read More
Abstract
This paper investigates the soundness of implicit world models in generative sequence models, focusing on their ability to adhere to the true structure of a domain's rules. Using chess as a testbed, the authors propose an adversarial sequence generation framework to evaluate whether these models produce valid outputs consistent with the formal rules of the domain. The study reveals that none of the evaluated models are fully sound, but training techniques, dataset quality, and adversarial strategies significantly influence performance. The paper also explores the role of linear board state probes in understanding model behavior and finds limited causal connections between the extracted board states and the models' predictions. The findings provide insights into the limitations of current generative models and suggest pathways for improving their adherence to formal world models.
Methodology
The authors use adversarial sequence generation to test the soundness of generative models trained on chess game transcripts. They evaluate models trained on datasets of varying quality and size, employing different training recipes. Adversarial sequences are crafted to force the models into generating invalid predictions. Additionally, linear board state probes are used to analyze the causal role of extracted board states in the models' predictions.
Results
The study finds that none of the tested generative models are fully sound, as they fail to consistently adhere to the formal rules of chess. However, certain training techniques and high-quality datasets improve soundness significantly. The analysis also reveals that linear board state probes, while useful for understanding information encoding, do not have a strong causal connection to the models' predictions.
Implications
The proposed adversarial framework provides a practical tool for evaluating the soundness of implicit world models in generative models. The findings highlight the limitations of current training methods and suggest the need for improved approaches to better align generative models with formal world models. This work has implications for applications in domains requiring strict adherence to formal rules, such as automated reasoning, game-playing AI, and formal language processing.
View on arXiv

When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL

Jan Malte Töpperwien, Aditya Mohan, Marius Lindauer
  • Offline goal-conditioned RL (GCRL) provides a controlled setting to study hyperparameter sensitivity by fixing data distributions and standardizing objectives.
  • QRL (quasimetric representation learning) exhibits broad, stable near-optimal hyperparameter regions, while HIQL (bootstrapped TD-learning) shows sharp optima and significant sensitivity to non-stationary data.
  • Destructive gradient interference in bootstrapped objectives is identified as a key factor contributing to hyperparameter sensitivity in HIQL.
  • Offline GCRL is generally more robust to hyperparameter changes compared to online RL, even under degraded data quality.
  • The study highlights the potential to design more robust RL algorithms by addressing the dynamics of bootstrapping and gradient interference.
Read More
Abstract
This paper investigates the sensitivity of reinforcement learning (RL) algorithms to hyperparameter configurations, specifically in the context of offline goal-conditioned RL (GCRL). The authors aim to determine whether hyperparameter sensitivity is an intrinsic property of RL or if it is amplified by specific training mechanisms such as bootstrapping and non-stationary data distributions. By leveraging the controlled environment of offline GCRL, where data distributions are fixed and non-stationarity can be explicitly introduced, the study compares two representative algorithms: HIQL (a bootstrapped TD-learning method) and QRL (a quasimetric representation learning method). The results show that offline GCRL is generally more robust to hyperparameter changes compared to online RL. However, the study identifies significant differences between the two algorithms: QRL demonstrates stable, broad near-optimal hyperparameter regions, while HIQL exhibits sharp optima and high sensitivity, particularly under non-stationary data regimes. The authors attribute this divergence to destructive gradient interference in bootstrapped objectives, which is more pronounced in HIQL. These findings suggest that hyperparameter sensitivity in RL is not inevitable but can be mitigated through careful algorithmic design.
Methodology
The authors use offline goal-conditioned RL (GCRL) as a controlled experimental setup to isolate the effects of hyperparameter sensitivity. They compare two algorithms: HIQL, which uses bootstrapped temporal difference (TD) learning, and QRL, which employs quasimetric representation learning without bootstrapping. The study involves constructing offline datasets with varying data quality (ranging from expert to exploratory trajectories) and introducing scheduled shifts in data quality to simulate non-stationary regimes. Extensive hyperparameter sweeps are conducted, and the resulting performance landscapes are analyzed using response surfaces, ε-optimality mass, and phase-resolved fANOVA profiles. Inter-goal gradient interference is also measured to explain differences in sensitivity between the two algorithms.
Results
The study finds that hyperparameter landscapes in offline GCRL are more stable and robust compared to online RL. QRL maintains broad and stable near-optimal hyperparameter regions, particularly when modest expert data (approximately 20%) is present. In contrast, HIQL exhibits sharp optima and significant sensitivity to changes in hyperparameter configurations, especially under non-stationary data regimes. The divergence is explained by stronger destructive gradient interference in HIQL, which arises from the interaction between bootstrapped targets and shifting data distributions.
Implications
The findings suggest that hyperparameter sensitivity in RL is not an intrinsic property but can be exacerbated by specific training mechanisms such as bootstrapping. This opens up opportunities to design more robust RL algorithms by addressing destructive gradient interference and leveraging the stability of representation learning approaches like QRL. The study also highlights the utility of offline GCRL as a controlled framework for analyzing RL dynamics and improving algorithmic robustness.
View on arXiv