gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-07 • Found 24 papers

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
  • The paper introduces Causal Analyst, a framework that integrates LLMs and GNN-based causal discovery to analyze jailbreak vulnerabilities in LLMs.
  • A dataset of 35,000 jailbreak attempts across seven LLMs was created, annotated with 37 interpretable prompt features to support causal analysis.
  • Key prompt features, such as 'Positive Character' and 'Number of Task Steps,' were identified as direct causal drivers of jailbreak success.
  • The framework enables two practical applications: a Jailbreaking Enhancer to improve attack success rates and a Guardrail Advisor to detect and mitigate malicious intent.
  • The causal approach outperforms non-causal methods in robustness, interpretability, and effectiveness in addressing jailbreak vulnerabilities.
Read More
Abstract
This paper introduces a novel framework, Causal Analyst, to analyze and address jailbreak vulnerabilities in large language models (LLMs) from a causal perspective. Jailbreak attacks exploit LLMs to generate harmful or policy-violating outputs, and understanding the mechanisms behind these attacks is critical for improving LLM safety. The authors construct a comprehensive dataset of 35,000 jailbreak attempts across seven LLMs, annotated with 37 human-readable prompt features. By combining LLM-based prompt encoding with graph neural networks (GNNs) for causal graph learning, the framework identifies direct causal relationships between specific prompt features and jailbreak occurrences. Key findings reveal that features like 'Positive Character' and 'Number of Task Steps' are primary drivers of jailbreak success. The paper demonstrates the utility of these insights through two applications: a Jailbreaking Enhancer that improves attack success rates by targeting causal features, and a Guardrail Advisor that uses the causal graph to detect and mitigate malicious intent in obfuscated queries. Extensive experiments validate the robustness and interpretability of the causal analysis, showing its superiority over non-causal approaches.
Methodology
The authors developed a dataset of 35,000 jailbreak attempts using 100 attack templates and 50 harmful queries, annotated with 37 human-readable prompt features. They combined LLM-based prompt encoding with GNN-based causal graph learning to reconstruct causal pathways between prompt features and jailbreak outcomes. The framework explicitly distinguishes successful jailbreaks based on Answer Harmfulness (AH) rather than simple refusal bypasses.
Results
The causal analysis identified specific prompt features as direct contributors to jailbreak success. The Jailbreaking Enhancer application significantly improved attack success rates by targeting these features, while the Guardrail Advisor effectively extracted malicious intent from obfuscated queries. Experimental results validated the robustness and interpretability of the causal framework, demonstrating its superiority over non-causal approaches.
Implications
This work provides a novel, interpretable approach to understanding and mitigating jailbreak vulnerabilities in LLMs. The insights can guide the development of more robust defenses against adversarial attacks and improve the safety and reliability of LLMs in real-world applications. Additionally, the causal framework could be extended to other domains where understanding the causal drivers of system vulnerabilities is critical.
View on arXiv

A Simple Reduction Scheme for Constrained Contextual Bandits with Adversarial Contexts via Regression

Dhruv Sarkar, Abhishek Sinha
  • Introduces a reduction-based algorithmic framework for constrained contextual bandits with adversarial contexts.
  • Leverages online regression oracles to estimate reward and cost functions under the realizability assumption.
  • Proposes an inverse-gap-weighting (IGW) policy with adaptive learning rates to balance exploration, exploitation, and constraint satisfaction.
  • Achieves improved regret and cumulative constraint violation (CCV) bounds compared to prior methods.
  • Provides a unified and modular analysis framework that simplifies the study of constrained contextual bandits.
Read More
Abstract
This paper addresses the problem of constrained contextual bandits (CCB) in adversarial settings, where contexts are chosen adversarially, and actions yield both random rewards and costs. The authors propose a novel reduction-based algorithmic framework that extends the SquareCB framework to handle long-term constraints in adversarial environments. Their approach leverages online regression oracles to estimate reward and cost functions, which are then used to construct surrogate objectives. These surrogate objectives are optimized using an inverse-gap-weighting (IGW) policy with adaptive learning rates, balancing exploration, exploitation, and constraint satisfaction. The proposed method achieves improved regret and cumulative constraint violation (CCV) bounds compared to prior work, particularly under adversarial contexts. The analysis is modular and transparent, relying on a single key inequality, and the framework is flexible enough to accommodate various feasibility assumptions. This work advances the state-of-the-art in constrained contextual bandits by providing robust guarantees in adversarial settings, which are critical for applications like recommendation systems, clinical trials, and resource-constrained decision-making.
Methodology
The authors extend the SquareCB framework by incorporating long-term constraints into the contextual bandit problem with adversarial contexts. They use online regression oracles to estimate the mean reward and cost functions, which are then used to construct surrogate objectives. These objectives are optimized using an inverse-gap-weighting (IGW) policy with adaptive learning rates. The analysis relies on a regret decomposition scheme and a single key inequality to derive performance guarantees.
Results
The proposed algorithm achieves improved regret and cumulative constraint violation (CCV) bounds under various feasibility assumptions, such as almost sure feasibility with general costs and expected feasibility with non-negative costs. The framework outperforms existing methods in adversarial settings, providing robust guarantees for constrained contextual bandits.
Implications
This work has significant implications for applications requiring decision-making under uncertainty with long-term constraints, such as personalized recommendation systems, clinical trials, and resource-constrained optimization problems. The ability to handle adversarial contexts makes the proposed framework particularly valuable in dynamic and competitive environments like online auctions and security games.
View on arXiv

Autodiscover: A reinforcement learning recommendation system for the cold-start imbalance challenge in active learning, powered by graph-aware Thompson sampling

Parsa Vares
  • AutoDiscover addresses the cold-start imbalance challenge in active learning by dynamically adapting query strategies using reinforcement learning.
  • The system models scientific literature as a heterogeneous graph, capturing relationships between documents, authors, and metadata.
  • A Heterogeneous Graph Attention Network (HAN) and Discounted Thompson Sampling (DTS) agent enable adaptive decision-making during the review process.
  • AutoDiscover outperforms static active learning baselines on the SYNERGY benchmark, demonstrating higher screening efficiency and better handling of minimal initial labels.
  • The TS-Insight dashboard provides interpretability and transparency for the system's decision-making process.
Read More
Abstract
This paper introduces AutoDiscover, a novel reinforcement learning-based framework designed to address the cold-start imbalance challenge in active learning, particularly in the context of systematic literature reviews (SLRs). Traditional active learning systems often rely on static query strategies, which fail to adapt to the dynamic nature of the screening process and overlook the relational structure inherent in scientific literature. AutoDiscover reframes active learning as an online decision-making problem, leveraging a heterogeneous graph representation of scientific literature to capture relationships between documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) is used to learn node representations, which are then utilized by a Discounted Thompson Sampling (DTS) agent to dynamically manage and adapt a portfolio of query strategies. The system balances exploration and exploitation, adapting to the non-stationary dynamics of the review process. Evaluated on the SYNERGY benchmark dataset, AutoDiscover outperforms static baselines in screening efficiency, particularly in cold-start scenarios with minimal initial labels. The work also introduces TS-Insight, an open-source visual analytics dashboard for interpreting the agent's decisions, enhancing transparency and usability. This framework has the potential to significantly accelerate systematic literature reviews, reducing the manual workload and improving the discovery of relevant studies.
Methodology
AutoDiscover employs a heterogeneous graph representation of scientific literature to capture structural relationships between documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) learns node representations, which are used by a Discounted Thompson Sampling (DTS) agent to manage a portfolio of query strategies. The agent dynamically balances exploration and exploitation, adapting to the evolving utility of different strategies based on real-time feedback from human-in-the-loop labels. The system is evaluated on the SYNERGY benchmark dataset, which includes 26 datasets with known ground truth for performance comparison.
Results
AutoDiscover demonstrated superior screening efficiency compared to static active learning baselines on the SYNERGY benchmark. It was particularly effective in addressing the cold-start challenge, successfully bootstrapping discovery from minimal initial labels. The system identified relevant studies at a significantly higher rate, reducing the manual effort required for systematic literature reviews.
Implications
AutoDiscover has the potential to revolutionize systematic literature reviews by significantly reducing the time and effort required to identify relevant studies. This can accelerate evidence-based research in fields such as public health, engineering, and climate science, where timely access to relevant literature is critical. The adaptive nature of the system ensures that it remains effective across diverse and evolving datasets, making it a valuable tool for researchers and policymakers.
View on arXiv

Can vision language models learn intuitive physics from interaction?

Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
  • Vision language models (VLMs) struggle to develop generalizable intuitive physics, even with interactive training using reinforcement learning.
  • Interactive training, modeled after human learning through interaction, does not significantly outperform non-interactive supervised fine-tuning in generalization tasks.
  • Both interactive and non-interactive methods enable VLMs to perform well on training tasks but fail to generalize to new tasks or contexts.
  • Physical quantities like tower stability are decodable from model activations, but this competence does not translate into improved task performance.
  • The study highlights the limitations of current VLM architectures and training paradigms in learning robust physical intuitions.
Read More
Abstract
This paper investigates whether vision language models (VLMs) can develop intuitive physics capabilities through interaction with their environment, inspired by cognitive science theories that emphasize the importance of active engagement in learning physical dynamics. The authors compare two training paradigms: an interactive condition where VLMs learn through trial-and-error using reinforcement learning (RL) and a non-interactive condition where VLMs are shown optimal action sequences. The study focuses on tasks such as building stable towers of blocks and generalizing to related tasks like judging tower stability. Despite achieving high performance on training tasks, neither approach enables VLMs to generalize robustly to new physical tasks or contexts. The authors also explore the decodability of physical quantities (e.g., tower stability) from model activations but find no evidence that interaction-based training improves generalization. The findings highlight the limitations of current VLMs in acquiring generalizable physical intuitions, even when trained interactively.
Methodology
The authors operationalized interaction using one-step reinforcement learning (RL), where VLMs were tasked with building stable towers of blocks in a simulated physics environment. Models in the interactive condition learned through trial-and-error, receiving rewards based on tower stability, while models in the non-interactive condition were shown optimal action sequences. The study evaluated generalization by testing models on unseen tower configurations and a related task of judging tower stability. Additionally, the authors analyzed model activations to assess the decodability of physical quantities.
Results
The study found no significant differences between the interactive and non-interactive conditions in terms of generalization. Both methods produced models that performed well on training tasks but failed to generalize to new tasks or contexts. While physical quantities like tower stability were decodable from model activations, this did not translate into improved performance on generalization tasks.
Implications
The findings suggest that current VLM architectures and training paradigms, including interactive reinforcement learning, are insufficient for developing generalizable intuitive physics. This highlights the need for new approaches to enable VLMs to acquire robust physical reasoning capabilities, which could have implications for applications in robotics, autonomous systems, and human-like AI reasoning.
View on arXiv

Chunky Post-Training: Data Driven Failures of Generalization

Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price
  • Chunky post-training refers to the unintended generalization of spurious correlations from discrete chunks of post-training data.
  • SURF is a tool for identifying unintended model behaviors at runtime, while TURF traces these behaviors back to specific training data patterns.
  • The study finds that these failures are prevalent across both proprietary and open-source LLMs, highlighting a systemic issue in current post-training practices.
  • Failures often arise from imbalanced or underspecified training data, leading to behaviors that conflict with user expectations or task requirements.
  • The authors provide open-source tools and a results explorer to facilitate further research and auditing of LLM behaviors.
Read More
Abstract
This paper introduces the concept of 'chunky post-training,' a phenomenon where large language models (LLMs) generalize unintended patterns from discrete chunks of post-training data. These chunks, designed to teach specific behaviors, often encode spurious correlations that lead to miscalibrated or unexpected model behaviors. For example, models may incorrectly associate specific prompt features (e.g., formatting or phrasing) with certain behaviors, resulting in failures such as rejecting true facts or misinterpreting user intent. To address this, the authors propose two tools: SURF (Surfacing Unintended Response Failures), a black-box pipeline for identifying these unintended behaviors during inference, and TURF (Tracing Unintended Responses via Features), which traces these failures back to specific patterns in the training data. The study demonstrates that these failures are widespread across both frontier models (e.g., GPT-5.1, Claude 4.5) and open models (e.g., TĂĽlu 3). The authors argue that understanding and mitigating these issues is critical for improving user trust, evaluation reliability, and the overall alignment of LLMs with intended behaviors.
Methodology
The authors developed two tools: SURF, a black-box auditing pipeline that identifies unintended behaviors during inference, and TURF, which maps these behaviors to specific features in the post-training data. These tools were applied to several state-of-the-art LLMs (e.g., Claude 4.5, GPT-5.1, Gemini 3, Grok 4.1) and an open-source model (TĂĽlu 3). The study involved analyzing model responses to various prompts and identifying patterns of misgeneralization linked to training data artifacts.
Results
The study demonstrates that chunky post-training failures are widespread across both proprietary and open-source LLMs. These failures often manifest as miscalibrated behaviors, such as rejecting true facts or misinterpreting user intent, and can be traced back to specific patterns in the training data. The authors provide empirical evidence that these issues are caused by imbalanced or underspecified data chunks used during post-training.
Implications
The findings highlight the need for more rigorous auditing and curation of post-training datasets to mitigate unintended behaviors in LLMs. The proposed tools, SURF and TURF, can help developers identify and address these issues, potentially improving model reliability, user trust, and evaluation accuracy. This research also underscores the importance of understanding the impact of training data on model behavior, which is critical for advancing the development of aligned and trustworthy AI systems.
View on arXiv

E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching

Wenting Li, Saif R. Kazi, Russell Bent, Duo Zhou, Huan Zhang
  • E-Globe introduces a hybrid verification framework combining tight upper bounds (via NLP–CC) and relaxation-based lower bounds within a branch-and-bound (BaB) framework.
  • The NLP–CC formulation ensures feasibility-preserving upper bounds and enables efficient pruning of unsafe subproblems.
  • Warm-started NLP solves and pattern-aligned strong branching significantly accelerate the verification process.
  • E-Globe achieves tighter bounds and faster verification compared to traditional MIP-based methods, especially on MNIST and CIFAR-10 datasets.
  • The method provides a scalable solution for certifying neural network robustness in safety-critical applications.
Read More
Abstract
This paper introduces E-Globe, a novel hybrid verification framework for neural networks that addresses the scalability-completeness trade-off in formal verification. The proposed method combines tight upper bounds, derived from a nonlinear program with complementarity constraints (NLP–CC), with relaxation-based lower bounds within a branch-and-bound (BaB) framework. The NLP–CC formulation preserves the ReLU input-output graph, ensuring that any feasible solution provides a valid counterexample and enables efficient pruning of unsafe subproblems. The authors also propose two key optimizations: warm-started NLP solves, which minimize constraint updates for faster computation, and pattern-aligned strong branching, which prioritizes splits that effectively tighten relaxations. The framework is evaluated on MNIST and CIFAR-10 datasets, demonstrating tighter upper bounds, faster per-node solves, and significant speedups over traditional mixed-integer programming (MIP)-based verification methods. These results highlight E-Globe's potential for scalable and robust neural network verification in safety-critical applications.
Methodology
The authors propose a hybrid verification approach that integrates tight upper bounds from a nonlinear program with complementarity constraints (NLP–CC) and relaxation-based lower bounds within a branch-and-bound (BaB) framework. The NLP–CC formulation preserves the ReLU input-output graph, ensuring valid counterexamples and efficient pruning. To enhance computational efficiency, the authors introduce warm-started NLP solves, which minimize updates to the constraint matrix, and pattern-aligned strong branching, which prioritizes splits aligned with the activation patterns of neurons.
Results
E-Globe achieves tighter upper bounds than projected gradient descent (PGD) across a wide range of perturbation radii. The proposed warm-starting and pattern-aligned branching techniques yield significant speedups, reducing the computational cost of verification. On MNIST and CIFAR-10 datasets, the method outperforms traditional MIP-based verification approaches, demonstrating faster convergence and improved scalability.
Implications
E-Globe provides a scalable and efficient framework for verifying the robustness of neural networks, making it particularly suitable for safety-critical applications such as autonomous systems, power grids, and medical diagnostics. By achieving tighter bounds and faster verification, the method enables more reliable deployment of neural networks in real-world scenarios where robustness guarantees are essential.
View on arXiv

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang
  • EBPO introduces a shrinkage estimator that combines local group statistics with global historical statistics to stabilize advantage estimation.
  • The framework addresses GRPO's limitations, including high variance with small group sizes and vanishing gradients in saturated failure regimes.
  • Theoretical analysis proves that EBPO achieves lower MSE, non-zero gradients in failure scenarios, and bounded entropy decay.
  • Empirical results demonstrate superior performance and training stability across benchmarks like AIME and OlympiadBench.
  • EBPO is particularly effective in resource-constrained settings and benefits from difficulty-stratified curriculum learning.
Read More
Abstract
This paper introduces Empirical Bayes Policy Optimization (EBPO), a novel framework designed to address stability challenges in Group Relative Policy Optimization (GRPO) for Reinforcement Learning with Verifiable Rewards (RLVR). GRPO, while computationally efficient, suffers from high variance in advantage estimation with small group sizes and vanishing gradients in saturated failure regimes. EBPO mitigates these issues by incorporating a shrinkage estimator that dynamically balances local group statistics with global historical performance statistics, estimated using Welford’s online algorithm. The authors theoretically demonstrate that EBPO reduces the Mean Squared Error (MSE) of advantage estimation, prevents vanishing gradients, and ensures bounded entropy decay. Empirical evaluations on benchmarks such as AIME and OlympiadBench show that EBPO outperforms GRPO and other baselines, particularly in resource-constrained settings and when combined with difficulty-stratified curriculum learning.
Methodology
EBPO reframes advantage estimation in GRPO using Empirical Bayes (EB) inference. It employs a shrinkage estimator that dynamically adjusts the baseline by combining local group statistics with a global prior, which is updated using Welford’s online algorithm. This approach reduces variance and ensures informative gradients even in saturated failure regimes. The framework is validated through theoretical analysis and empirical evaluations on diverse RLVR benchmarks.
Results
EBPO consistently outperforms GRPO and other baselines across multiple benchmarks, achieving over 11% improvement in resource-constrained settings with small group sizes (G=8). It demonstrates enhanced training stability, reduced variance in advantage estimation, and superior performance when combined with difficulty-stratified curriculum learning.
Implications
EBPO provides a robust and efficient solution for stabilizing advantage estimation in RLVR, making it particularly valuable for improving the reasoning capabilities of Large Language Models (LLMs) in tasks like mathematical reasoning and code generation. Its ability to operate effectively in resource-constrained settings and leverage curriculum learning suggests potential applications in scalable, cost-efficient training of advanced AI systems.
View on arXiv

EdgeMask-DG*: Learning Domain-Invariant Graph Structures via Adversarial Edge Masking

Rishabh Bhattacharya, Naresh Manwani
  • Proposes EdgeMask-DG, a min-max adversarial framework for learning domain-invariant edge masks in graph structures.
  • Introduces EdgeMask-DG*, which extends EdgeMask-DG by leveraging enriched graph structures that integrate original topology with feature-derived edges.
  • Utilizes a GAT backbone to incorporate learned continuous edge masks as edge attributes, improving the model's ability to focus on robust substructures.
  • Achieves state-of-the-art performance on multiple Graph-DG benchmarks, including citation networks, social networks, and e-commerce graphs.
  • Improves worst-case domain accuracy on the Cora OOD benchmark to 78.0%, a 3.8 percentage point improvement over prior methods.
Read More
Abstract
This paper introduces EdgeMask-DG*, a novel framework for Graph Domain Generalization (Graph-DG) that addresses structural distribution shifts in graph neural networks (GNNs). The method builds upon EdgeMask-DG, a min-max adversarial learning framework where an edge masker network learns to generate sparse masks over graph edges to challenge a task GNN. EdgeMask-DG* extends this by applying the adversarial masking principle to an enriched graph structure that combines the original topology with feature-derived edges (e.g., k-Nearest Neighbors and spectral clustering). This enriched representation allows the model to discover domain-invariant structural patterns even when the original graph topology is noisy or domain-specific. The framework employs a Graph Attention Network (GAT) backbone, which integrates the learned edge masks as edge attributes to enhance message passing. Experimental results demonstrate that EdgeMask-DG* achieves state-of-the-art performance across various benchmarks, including citation networks, social networks, and e-commerce graphs, significantly improving worst-case domain accuracy.
Methodology
EdgeMask-DG* employs a min-max adversarial learning framework where an edge masker network generates sparse masks over graph edges to challenge a task GNN. The task GNN is trained to perform robustly under these adversarial conditions, encouraging it to learn domain-invariant structural patterns. The framework extends this approach to an enriched graph representation, which combines the original graph topology with feature-derived edges using k-Nearest Neighbors and spectral clustering. A GAT backbone is used to incorporate the learned edge masks as edge attributes, enhancing the model's ability to focus on relevant structural information.
Results
EdgeMask-DG* achieves new state-of-the-art results on diverse Graph-DG benchmarks, including citation networks (ACM, DBLP, Citation, ArXiv), social networks (Facebook-100, Twitch), and e-commerce graphs (Amazon-Photo, Elliptic). On the Cora OOD benchmark, it improves the worst-case domain accuracy to 78.0%, representing a 3.8 percentage point improvement over the previous state-of-the-art method.
Implications
EdgeMask-DG* has significant implications for applications where graph data is subject to structural distribution shifts, such as social network analysis, bioinformatics, and recommendation systems. By enabling GNNs to generalize across domains with varying graph topologies, the framework can improve the robustness and reliability of machine learning models in real-world scenarios involving heterogeneous or evolving data sources.
View on arXiv

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi
  • Introduces EntRGi, an entropy-aware reward guidance mechanism for discrete diffusion language models.
  • Addresses the limitations of existing methods that rely on continuous relaxations or the straight-through estimator for gradient propagation.
  • Uses entropy to dynamically interpolate between continuous and discrete token embeddings, ensuring reliable inputs for the reward model.
  • Demonstrates consistent performance improvements over state-of-the-art methods across multiple benchmarks and reward models.
  • Provides detailed empirical analysis to explain the mechanisms driving EntRGi's effectiveness.
Read More
Abstract
This paper introduces EntRGi, a novel entropy-aware reward guidance mechanism for discrete diffusion language models (dLLMs). Diffusion language models generate text by iteratively denoising masked sequences, but their discrete token outputs make gradient propagation from reward models challenging. Existing methods address this issue by either using continuous relaxations of discrete tokens or employing the straight-through estimator (STE). However, these approaches suffer from degraded gradient feedback or optimization mismatches. EntRGi overcomes these limitations by dynamically modulating the use of continuous relaxations and hard token embeddings based on the model's confidence, as measured by entropy. This ensures that the reward model receives reliable inputs while improving gradient-based reward guidance. The authors validate EntRGi on a 7B-parameter diffusion language model across three reward models and three multi-skill benchmarks, demonstrating consistent improvements over state-of-the-art methods. The paper also provides a detailed analysis of EntRGi's mechanisms and its advantages over prior approaches.
Methodology
The authors propose EntRGi, which dynamically interpolates between continuous relaxations and hard token embeddings based on the entropy of the model's predictions. This approach ensures that the reward model receives inputs it can reliably interpret during the denoising process. Both the diffusion language model and the reward model are kept frozen during inference, and gradients from the reward model are used to modify the logits of the masked positions in the diffusion process.
Results
EntRGi was evaluated on a 7B-parameter diffusion language model using three reward models and three multi-skill benchmarks. The approach consistently outperformed state-of-the-art methods in terms of reward-guided text generation. The authors also conducted a detailed empirical analysis, showing that EntRGi's entropy-aware mechanism effectively balances the trade-off between continuous and discrete representations, leading to more reliable optimization.
Implications
EntRGi has significant implications for controllable text generation in discrete diffusion language models. By improving reward-guided optimization, it enables more effective inference-time steering of large language models without requiring expensive retraining. This approach could be applied to tasks such as stylization, semantic editing, and solving inverse problems in text generation.
View on arXiv

Escaping Local Minima Provably in Non-convex Matrix Sensing: A Deterministic Framework via Simulated Lifting

Tianqi Shen, Jinji Yang, Junze He, Kunhan Gao, Ziye Ma
  • Introduces the Simulated Oracle Direction (SOD) Escape framework to deterministically escape local minima in non-convex optimization.
  • Leverages simulated over-parameterization to identify escape directions without the computational cost of explicit over-parameterization.
  • Provides theoretical guarantees for escaping spurious local minima without relying on randomness or heuristics.
  • Demonstrates the effectiveness of the framework in low-rank matrix sensing problems through numerical experiments.
  • Highlights the potential for extending the approach to other non-convex optimization problems.
Read More
Abstract
This paper addresses the challenge of escaping spurious local minima in non-convex optimization, specifically in the context of low-rank matrix sensing. The authors propose a novel deterministic framework called Simulated Oracle Direction (SOD) Escape, which leverages insights from over-parameterization without incurring its computational costs. Over-parameterization has been shown to improve optimization landscapes by converting local minima into strict saddle points, but its practical implementation is computationally expensive. The SOD framework simulates the escape directions from an over-parameterized space and projects them back into the original parameter space, ensuring a strict decrease in the objective function. Unlike existing methods that rely on random perturbations or heuristic rules, this approach is theoretically grounded and deterministic. Numerical experiments demonstrate that the proposed method reliably escapes spurious local minima and converges to global optima with minimal computational overhead. The framework has broader implications for non-convex optimization problems beyond matrix sensing.
Methodology
The authors develop a deterministic escape mechanism by simulating the effects of over-parameterization in a high-dimensional space and projecting the resulting escape directions back into the original parameter space. This approach avoids the computational and memory costs of explicit over-parameterization. The framework is applied to the structured matrix sensing problem, where the goal is to recover a low-rank positive semidefinite matrix from linear measurements. Theoretical analysis ensures that the projected escape directions lead to a strict decrease in the objective function.
Results
The proposed SOD framework successfully escapes spurious local minima and converges to global optima in numerical experiments on matrix sensing problems. The method achieves this with minimal computational overhead compared to explicit over-parameterization, demonstrating its efficiency and reliability.
Implications
The SOD framework has significant implications for non-convex optimization, particularly in machine learning and signal processing applications. By providing a computationally efficient and theoretically grounded method for escaping local minima, it could improve optimization performance in tasks such as phase retrieval, quantum tomography, collaborative filtering, and power system state estimation. Additionally, the approach may inspire new strategies for addressing challenging optimization landscapes in other domains.
View on arXiv

Exact Recovery in the Data Block Model

Amir R. Asadi, Akbar Davoodi, Ramin Javadi, Farzad Parvaresh
  • The paper extends the stochastic block model (SBM) by incorporating node-associated data, formalized as the Data Block Model (DBM).
  • A novel metric, Chernoff–TV divergence, is introduced to characterize the exact recovery threshold in the DBM.
  • An efficient algorithm is proposed to achieve the exact recovery threshold, with a matching impossibility result below the threshold.
  • Simulations demonstrate the benefits of leveraging vertex data for community detection, especially in challenging regimes.
  • The study bridges network-based and data-driven approaches to clustering, offering a refined understanding of community detection limits.
Read More
Abstract
This paper investigates the problem of exact recovery in community detection within the Data Block Model (DBM), an extension of the stochastic block model (SBM) that incorporates node-associated data. While traditional SBMs focus solely on graph connectivity, the DBM integrates additional vertex data, such as node attributes, to enhance community detection. The authors introduce the Chernoff–TV divergence as a metric to establish a sharp threshold for exact recovery in the DBM. They propose an efficient algorithm that achieves this threshold and provide a matching converse result demonstrating the impossibility of exact recovery below the threshold. Simulations validate the theoretical findings and highlight the advantages of incorporating node-specific data as side information, particularly in scenarios where graph structure alone is insufficient for accurate community detection.
Methodology
The authors use the Chernoff–TV divergence to derive a sharp threshold for exact recovery in the DBM. They design an efficient algorithm that achieves this threshold and prove a converse result showing the impossibility of recovery below the threshold. The theoretical results are supported by simulations that validate the effectiveness of incorporating node-specific data.
Results
The paper establishes a sharp threshold for exact recovery in the DBM using the Chernoff–TV divergence. The proposed algorithm achieves this threshold efficiently, and simulations confirm the theoretical predictions. The results demonstrate that incorporating node attributes significantly improves community detection performance, especially in cases where graph connectivity alone is insufficient.
Implications
The findings have implications for improving community detection in real-world networks enriched with node-specific data, such as social media, biological networks, and citation networks. The integration of vertex data can enhance applications like recommendation systems, fraud detection, and biological analysis by enabling more accurate identification of community structures.
View on arXiv

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Emily Dent, Jared Tanner
  • The paper extends the Edge-of-Chaos (EoC) theory to activation functions that induce high sparsity, such as CReLU and CST.
  • Controlling the fixed variance (q*) of the Gaussian process improves training stability and accuracy in sparsely activated networks.
  • The proposed method enables effective training of DNNs and CNNs with sparsity levels up to 90%, reducing computational and energy costs.
  • The study highlights the importance of variance control in mitigating training instability caused by sparsifying activation functions.
  • The findings suggest potential applications in energy-efficient machine learning models for edge devices.
Read More
Abstract
This paper explores how controlling the variance of the Gaussian process in the intermediate layers of deep neural networks (DNNs) and convolutional neural networks (CNNs) can improve training stability, particularly for sparsely activated networks. The authors extend the Edge-of-Chaos (EoC) initialization theory to analyze activation functions that induce high levels of sparsity, such as shifted and clipped ReLU (CReLU) and CST. They demonstrate that increasing the fixed variance of the Gaussian process (q*) enhances the stability of training and allows for effective training of networks with sparsity levels as high as 90%. This approach not only improves the computational efficiency of DNNs and CNNs by reducing energy consumption but also maintains high accuracy despite the sparsity. The work provides a theoretical foundation for the relationship between variance control, activation sparsity, and training stability, supported by experimental results.
Methodology
The authors extend the EoC initialization theory to analyze the behavior of sparsity-inducing activation functions, such as CReLU and CST, which are zero around the origin. They investigate the impact of varying the fixed variance (q*) of the Gaussian process on training stability and accuracy. Theoretical analysis is complemented by experiments on DNNs and CNNs with varying sparsity levels to validate the proposed approach.
Results
The study demonstrates that increasing the fixed variance (q*) improves the stability of training and enables effective learning in DNNs and CNNs with sparsity levels up to 90%. The proposed approach achieves near-full accuracy even at high sparsity levels, showcasing its potential for energy-efficient deep learning models.
Implications
The findings have significant implications for the development of energy-efficient machine learning models, particularly for deployment on resource-constrained edge devices. By enabling stable training of highly sparse networks, the proposed approach can reduce computational and energy costs while maintaining model performance. This work also opens new avenues for further research into variance control and sparsity in neural network design.
View on arXiv

Joint Embedding Variational Bayes

Amin Oji, Paul Fieguth
  • VJE combines variational inference and joint embedding to enable probabilistic self-supervised learning without reconstruction or contrastive objectives.
  • The framework uses a symmetric conditional ELBO and a Student–t likelihood with polar decomposition to stabilize training and capture uncertainty.
  • VJE achieves competitive performance with leading non-contrastive methods on benchmarks like ImageNet-1K, CIFAR-10/100, and STL-10.
  • The probabilistic nature of VJE enables effective anomaly detection, outperforming comparable self-supervised baselines.
  • VJE eliminates the need for negative samples and auxiliary projection heads, simplifying the training process.
Read More
Abstract
This paper introduces Variational Joint Embedding (VJE), a novel framework that integrates joint embedding and variational inference to enable self-supervised learning of probabilistic representations. Unlike traditional contrastive and non-contrastive methods, VJE avoids the need for negative samples or reconstruction-based objectives. Instead, it employs a symmetric conditional evidence lower bound (ELBO) to train a latent-variable model directly on encoder embeddings. The framework uses a Student–t likelihood with polar decomposition to decouple directional and radial factors, mitigating norm-induced instabilities during training. VJE also incorporates an amortized inference network to parameterize a diagonal Gaussian variational posterior, capturing anisotropic uncertainty without requiring auxiliary projection heads. The proposed method achieves competitive performance with state-of-the-art non-contrastive baselines on standard benchmarks such as ImageNet-1K, CIFAR-10/100, and STL-10. Additionally, VJE demonstrates superior performance in anomaly detection tasks, showcasing its ability to produce probabilistic representations with calibrated uncertainty.
Methodology
VJE defines a latent-variable model directly on encoder embeddings and trains it using a symmetric conditional ELBO. The conditional likelihood is modeled with a Student–t distribution using polar decomposition to decouple directional and radial factors. An amortized inference network parameterizes a diagonal Gaussian variational posterior, sharing feature-wise variances with the likelihood scale to capture anisotropic uncertainty. The target branch is detached during training to implement fixed-observation conditioning.
Results
VJE achieves performance comparable to state-of-the-art non-contrastive methods on benchmarks such as ImageNet-1K, CIFAR-10/100, and STL-10 under linear and k-NN evaluation protocols. Additionally, in a one-class anomaly detection task on CIFAR-10, VJE outperforms other self-supervised baselines, demonstrating the utility of its probabilistic representations.
Implications
The introduction of VJE provides a principled alternative to traditional pointwise energy-based objectives in self-supervised learning. Its probabilistic representations with calibrated uncertainty have potential applications in uncertainty-sensitive domains such as medical diagnosis, anomaly detection, and reinforcement learning. By eliminating the need for negative samples and auxiliary projection heads, VJE simplifies training and broadens the applicability of self-supervised learning methods.
View on arXiv

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim
  • The paper formalizes the concept of reliability in AVQA by framing it as a selective prediction problem, where models can abstain from answering to avoid incorrect predictions.
  • Adaptive Confidence Refinement (ACR) is introduced as a novel, learnable confidence estimation framework that refines the Maximum Softmax Probability (MSP) baseline using multimodal features and pre-softmax logits.
  • ACR incorporates two key components: a Residual Risk Head for predicting residual uncertainty and a Confidence Gating Head to assess MSP reliability.
  • The proposed method outperforms existing baselines across three AVQA datasets and architectures, achieving state-of-the-art risk-coverage trade-offs.
  • This work highlights the importance of reliability in AVQA systems, particularly for applications involving users with sensory impairments.
Read More
Abstract
This paper addresses the challenge of improving the reliability of Audio-Visual Question Answering (AVQA) systems by introducing a novel framework called Reliable Audio-Visual Question Answering (R-AVQA). The authors propose Adaptive Confidence Refinement (ACR), a lightweight, learnable confidence estimation method that enhances the reliability of AVQA models by enabling them to abstain from answering when predictions are uncertain. Unlike existing methods that rely on fixed heuristics or post-hoc confidence estimation, ACR refines the Maximum Softmax Probability (MSP) baseline by introducing two learned components: a Residual Risk Head to predict residual uncertainty and a Confidence Gating Head to assess the reliability of MSP. The proposed method is evaluated on three AVQA datasets and across three different AVQA architectures, demonstrating consistent improvements in risk-coverage trade-offs, particularly in challenging scenarios such as out-of-distribution generalization and data bias. This work establishes a foundation for developing reliable AVQA systems that can provide accurate answers while abstaining from uncertain predictions, which is particularly important for applications involving users with sensory impairments.
Methodology
The authors propose Adaptive Confidence Refinement (ACR), which builds on the Maximum Softmax Probability (MSP) baseline. ACR introduces two learned components: (1) a Residual Risk Head that predicts low-magnitude residuals to capture uncertainty signals missed by MSP, and (2) a Confidence Gating Head that determines the trustworthiness of MSP. ACR uses a simple linear fusion of these components, modulated by an input-adaptive weighting mechanism. The method is evaluated on three AVQA datasets (MUSIC-AVQA, MUSIC-AVQA-R, and MUSIC-AVQA-v2.0) and three representative AVQA architectures, comparing its performance against existing baselines such as MSP, Monte Carlo Dropout (MCD), and calibration-based methods.
Results
ACR consistently outperforms existing methods in terms of risk-coverage trade-offs across all evaluated datasets and architectures. For example, on the MUSIC-AVQA dataset, ACR enables models to answer a significantly higher percentage of questions (compared to the state-of-the-art QA-TIGER method) while maintaining a low error rate of 1%. The method also demonstrates robustness under out-of-distribution and data bias scenarios, establishing its effectiveness for reliable AVQA tasks.
Implications
The proposed ACR framework has significant implications for the development of reliable AVQA systems, particularly for applications involving users with sensory impairments who rely on accurate and trustworthy responses. By enabling models to abstain from answering when uncertain, ACR reduces the risk of providing incorrect information, which is critical in high-stakes or assistive technology scenarios. Additionally, the learnable confidence estimation approach introduced in this work could inspire further research in multimodal reasoning tasks requiring reliability and uncertainty estimation.
View on arXiv

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

Minyoung Kim
  • The paper proposes a scalable score-based variational inference method for Bayesian deep neural networks.
  • The method combines score-matching loss with a proximal penalty term, enabling the use of noisy, unbiased mini-batch scores.
  • It avoids computationally expensive operations like Hessian inversion and supports richer variational density families beyond Gaussians.
  • The approach is demonstrated on large-scale models, including Vision Transformers and ResNets, for tasks like visual recognition and time-series forecasting.
  • The method achieves faster convergence and better scalability compared to traditional ELBO-based and Gaussian score-matching approaches.
Read More
Abstract
This paper introduces a novel score-based variational inference (VI) method for Bayesian deep neural networks (BNNs) that is scalable to large-scale models and datasets. Bayesian neural networks offer advantages such as uncertainty quantification, robustness to noise, and resistance to overfitting, but their posterior inference is computationally challenging. While traditional VI methods like ELBO-based approaches are widely used, score-based methods have shown promise in certain scenarios. However, existing score-based methods are often computationally prohibitive for large-scale BNNs due to issues like reliance on Hessian computations and the inability to handle mini-batch stochastic gradients. The proposed method addresses these limitations by combining a score-matching loss with a proximal penalty term, enabling the use of noisy, unbiased mini-batch scores through stochastic gradients. This approach avoids reparameterized sampling and supports richer variational density families beyond Gaussian distributions. The method is demonstrated to be effective on benchmarks involving large-scale models like Vision Transformers (ViT) and ResNets for tasks such as visual recognition and time-series forecasting.
Methodology
The proposed method introduces a novel optimization objective that combines score-matching loss with a proximal penalty term. This design avoids the need for reparameterized sampling and allows for the use of noisy, unbiased mini-batch scores through stochastic gradients. Unlike existing Gaussian score-matching methods, the approach is scalable to large-scale problems and supports richer variational density families. The method is evaluated on benchmarks involving large-scale models like Vision Transformers and ResNets.
Results
The proposed method demonstrates effectiveness on several benchmarks, including visual recognition and time-series forecasting tasks. It achieves faster convergence and better scalability compared to traditional ELBO-based and Gaussian score-matching methods. The approach is shown to work well with large-scale models like Vision Transformers and ResNets, highlighting its practical applicability to modern deep learning architectures.
Implications
This work has significant implications for scalable Bayesian inference in deep learning. By enabling efficient posterior inference for large-scale models, the method can improve uncertainty quantification, robustness, and generalization in applications such as computer vision, time-series forecasting, and other domains requiring large neural networks. The ability to handle richer variational density families also opens up possibilities for more accurate and flexible Bayesian modeling.
View on arXiv

Layer-wise LoRA fine-tuning: a similarity metric approach

Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
  • The paper introduces a systematic method to select transformer layers for fine-tuning based on their contribution to changes in internal representations.
  • The proposed method reduces trainable parameters in LoRA-based fine-tuning by up to 50% while maintaining predictive performance.
  • The approach is orthogonal to existing LoRA techniques and can be easily integrated with them to enhance computational efficiency.
  • The method is validated on encoder-only, decoder-only, and multimodal models, showing negligible or no performance degradation on benchmarks like GLUE and GSM8K.
  • The similarity metric used to measure layer importance is based on the difference between input and output representations of each layer.
Read More
Abstract
This paper addresses the computational inefficiencies of fine-tuning large language models (LLMs) by proposing a novel method to systematically select specific transformer layers for fine-tuning using Low-Rank Adaptation (LoRA). The authors argue that not all layers contribute equally to model adaptation and introduce a similarity metric to measure the importance of each layer based on its contribution to changes in internal representations. By fine-tuning only the most relevant layers, the proposed method reduces the number of trainable parameters by up to 50% compared to standard LoRA fine-tuning, while maintaining or even improving predictive performance across various tasks and architectures. The approach is compatible with existing LoRA-based techniques and is validated on encoder-only, decoder-only, and multimodal models, demonstrating competitive results on benchmarks such as GLUE, GSM8K, and coding tasks.
Methodology
The authors propose a similarity metric to evaluate the importance of transformer layers by measuring the difference between input and output representations of each layer. Layers with lower similarity are deemed more important for task-specific adaptation. Using this metric, they systematically select a subset of layers for fine-tuning with LoRA modules. This approach is applied to various architectures, including encoder-only, decoder-only, and multimodal models, and is compatible with existing LoRA techniques.
Results
The proposed method achieves up to a 50% reduction in trainable parameters compared to standard LoRA fine-tuning, with minimal or no loss in predictive performance. On encoder-only architectures, the method shows negligible performance drops on the GLUE benchmark. For decoder-only architectures, it achieves small drops or even improvements in tasks like mathematical problem-solving and coding. The method also demonstrates competitive results on multimodal models, maintaining performance while significantly reducing computational costs.
Implications
This work has significant implications for the scalability and accessibility of fine-tuning large language models, particularly in resource-constrained environments. By reducing the computational burden of fine-tuning, the proposed method enables more researchers and organizations to adapt LLMs for specific tasks. Additionally, the approach could inspire further research into layer-wise optimization techniques for other parameter-efficient fine-tuning methods.
View on arXiv

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan, Han Zhong
  • The paper identifies optimism as a unifying principle to stabilize Thompson Sampling for adaptive inference in multi-armed bandits.
  • Two optimistic modifications to TS are proposed: variance inflation and mean bonus, both of which ensure stability for K-armed Gaussian bandits (K ≥ 2).
  • The stability results enable asymptotically valid inference, including confidence intervals and hypothesis tests, even under adaptive data collection.
  • The authors resolve an open problem by extending stability guarantees from two-armed to general K-armed bandits, including cases with multiple optimal arms.
  • The proposed methods achieve stability with only a mild additional regret cost, maintaining efficiency in exploration-exploitation trade-offs.
Read More
Abstract
This paper investigates the stability of Thompson Sampling (TS) in the context of adaptive inference for multi-armed bandits, particularly focusing on the challenges posed by adaptive data collection. The authors identify optimism as a key mechanism to restore stability in TS, which is essential for enabling valid asymptotic inference. They propose two optimistic modifications to TS: variance inflation and mean bonus. Both approaches ensure stability by concentrating arm-specific sample sizes around deterministic scales, even in challenging regimes with multiple optimal arms. The paper extends prior work by proving stability guarantees for general K-armed Gaussian bandits (K ≥ 2), resolving an open question about TS stability in multi-arm settings. The authors also demonstrate that their methods incur only a mild additional regret cost while enabling asymptotically valid inference, such as confidence intervals and hypothesis testing, under adaptive data collection.
Methodology
The authors analyze the stability of Thompson Sampling in K-armed Gaussian bandits by introducing two optimistic modifications: (1) variance-inflated TS, which increases the posterior sampling variance, and (2) mean-bonus TS, which adds a positive bonus to the posterior mean. They derive theoretical guarantees for stability by showing that arm-specific sample sizes concentrate around deterministic scales, enabling asymptotic normality of studentized sample means. The analysis includes rigorous proofs and new techniques for stability analysis in multi-armed bandits.
Results
The paper proves that both variance-inflated TS and mean-bonus TS satisfy stability for K-armed Gaussian bandits, including cases with multiple optimal arms. The stability guarantees enable asymptotically valid inference, such as constructing Wald-type confidence intervals. The proposed methods achieve these results with only a mild additional regret cost, maintaining the efficiency of Thompson Sampling for exploration and exploitation.
Implications
The findings have significant implications for adaptive experimentation and online A/B testing, where valid statistical inference is critical despite adaptive data collection. The proposed optimistic TS variants can be applied to improve the reliability of inference in real-world applications, such as clinical trials, online recommendation systems, and dynamic pricing, while preserving the efficiency of decision-making.
View on arXiv

Orthogonal Self-Attention

Leo Zhang, James Martens
  • Orthogonal Self-Attention (OSA) is proposed to address the instability of Softmax Self-Attention (SSA) in skipless Transformers.
  • OSA enforces orthogonality in the attention matrix using the matrix exponential of skew-symmetric matrices derived from query-key values.
  • The computational complexity of OSA scales linearly with sequence length by leveraging the low-rank structure of the query-key matrices.
  • OSA avoids rank collapse and ensures well-conditioned Jacobians, enabling stable training without skip connections or normalization layers.
  • OSA is particularly suited for non-causal decoder-based Transformers, such as Vision Transformers (ViTs) and Diffusion Transformers (DiTs).
Read More
Abstract
This paper introduces Orthogonal Self-Attention (OSA), a novel attention mechanism designed to address the instability issues of Softmax Self-Attention (SSA) in skipless Transformer architectures. SSA, a core component of Transformers, suffers from rank collapse and poorly-conditioned Jacobians when skip connections and normalization layers are removed, which hinders stable training and representation learning. OSA mitigates these issues by enforcing orthogonality in the attention matrix, achieved by mapping skew-symmetric matrices (derived from query-key values) through the matrix exponential. The authors propose an efficient implementation of OSA that exploits the low-rank structure of the query-key matrices, reducing computational complexity to scale linearly with sequence length. They also derive an initialization scheme that ensures the Jacobian of OSA is well-conditioned, enabling stable training. The paper demonstrates that OSA preserves the rank of token representations, avoiding the rank collapse problem of SSA. While OSA is limited to non-causal decoder-based Transformers, such as Vision Transformers (ViTs) and Diffusion Transformers (DiTs), it provides a promising alternative for training deep models without skip connections or normalization layers.
Methodology
OSA is implemented by parametrizing the attention matrix as an orthogonal matrix using the matrix exponential of skew-symmetric matrices derived from query-key values. The authors propose an efficient computation of the matrix exponential by exploiting the low-rank structure of the query-key matrices, reducing computational complexity to O(N). They also derive an initialization scheme to ensure the Jacobian of OSA is well-conditioned, facilitating stable training. Theoretical analysis is conducted to demonstrate that OSA avoids rank collapse and preserves the rank and eigenvalues of token representations.
Results
Theoretical analysis shows that OSA avoids the rank collapse issue associated with SSA and ensures well-conditioned Jacobians, enabling stable training of skipless Transformers. The computational complexity of OSA is reduced to linear scaling with sequence length, making it more efficient than SSA. The authors also provide mathematical proofs to support their claims about the stability and efficiency of OSA.
Implications
OSA provides a pathway for training deep Transformer architectures without relying on skip connections or normalization layers, which could lead to improved representation learning and more efficient models. Its application to non-causal decoder-based Transformers, such as ViTs and DiTs, could enhance their performance and stability in tasks like image recognition and generative modeling.
View on arXiv

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski
  • SLAY introduces a geometry-aware attention mechanism based on the Yat-kernel, inspired by inverse-square physics interactions.
  • The method enforces unit-norm constraints on queries and keys, decoupling alignment from distance.
  • SLAY achieves linear O(L) time complexity by reformulating the Yat-kernel using Bernstein's Theorem and approximating it with positive random features.
  • Empirical evaluations show SLAY matches softmax attention in performance while outperforming other linear-time attention mechanisms.
  • SLAY enables scalable Transformers without the typical trade-offs associated with attention linearization.
Read More
Abstract
This paper introduces SLAY (Spherical Linearized Attention with Yat-Kernels), a novel linear-time attention mechanism that leverages the geometry-aware Yat-kernel, inspired by inverse-square interactions in physics. SLAY addresses the computational inefficiencies of the Yat-kernel by constraining queries and keys to the unit sphere, ensuring attention depends solely on angular alignment. Using Bernstein's Theorem, the authors reformulate the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels, enabling a strictly positive random-feature approximation. This allows SLAY to achieve linear O(L) time complexity while maintaining the expressive power of softmax attention. Empirical results demonstrate that SLAY performs comparably to standard softmax attention and outperforms prior linear-time mechanisms like Performers and Cosformers, making it a scalable and efficient alternative for long-context Transformers.
Methodology
The authors reformulate the Yat-kernel using Bernstein's Theorem to express it as a nonnegative mixture of polynomial-exponential product kernels. They then approximate this kernel using strictly positive Tensor Product Random Features. By constraining queries and keys to the unit sphere, SLAY ensures geometry-aware attention that operates in linear time. The approach is validated through theoretical analysis and empirical benchmarks on language and vision tasks.
Results
SLAY achieves performance nearly indistinguishable from standard softmax attention while maintaining linear time and memory scaling. It consistently outperforms prior linear-time attention mechanisms, such as Performers and Cosformers, in both quality and efficiency. Speed tests confirm its scalability for long-context tasks.
Implications
SLAY has significant implications for scaling Transformers to handle long-context sequences efficiently without sacrificing performance. Its geometry-aware design and linear-time complexity make it suitable for applications in natural language processing, computer vision, and other domains requiring efficient attention mechanisms for large-scale data.
View on arXiv

Stable but Wrong: When More Data Degrades Scientific Conclusions

Zhipeng Zhang, Kai Li
  • Inference can be stable but wrong: Standard procedures may confidently converge to incorrect conclusions under unobservable reliability drift.
  • Conventional diagnostic tools, such as residual statistics and goodness-of-fit measures, fail to detect this epistemic failure.
  • Accumulating more data in the presence of unobservable drift amplifies errors rather than correcting them, deepening commitment to incorrect conclusions.
  • The findings reveal a fundamental limitation of data-driven inference, independent of model complexity or algorithm sophistication.
  • Scientific inference must be governed by explicit constraints on observational integrity, rather than relying solely on data availability and internal diagnostics.
Read More
Abstract
This paper challenges the widely held assumption that accumulating more data always improves the reliability of scientific conclusions. The authors identify a structural regime where standard inference procedures, despite being stable, well-calibrated, and passing diagnostic checks, systematically converge to incorrect conclusions due to unobservable reliability drift in the data. This drift, which arises from gradual and latent changes in observational reliability (e.g., instrument degradation or environmental shifts), cannot be detected by conventional diagnostics or corrected by simply collecting more data. Using theoretical analysis and synthetic experiments, the authors demonstrate that additional data in such scenarios amplifies errors rather than correcting them, creating an 'epistemic trap' where confidence in incorrect conclusions grows irreversibly. The paper argues for a paradigm shift in data-driven science, emphasizing the need for explicit, externally validated constraints on the integrity of the observational process to ensure epistemic validity.
Methodology
The authors use a combination of theoretical analysis and minimal synthetic experiments to study the effects of unobservable reliability drift on inference. They formalize a mathematical model where observational data is subject to a slowly varying, latent bias that cannot be detected within any finite observation window. They then simulate this scenario to demonstrate how standard inference procedures behave under such conditions, analyzing the stability, convergence, and diagnostic signals of the resulting estimates.
Results
The study establishes that under unobservable reliability drift, standard inference procedures converge confidently to biased estimates, with posterior uncertainty contracting over time. Conventional diagnostic signals, such as residuals and goodness-of-fit measures, remain normal and fail to detect the underlying bias. Furthermore, the accumulation of additional data exacerbates the error rather than correcting it, reinforcing confidence in incorrect conclusions. These findings highlight a structural epistemic trap where stability, convergence, and confidence become misleading indicators of validity.
Implications
The paper challenges the assumption that more data inherently improves scientific inference, emphasizing the need for external validation of observational integrity. This has significant implications for fields relying on large-scale observational data, such as climate science, medicine, and astronomy. It calls for a shift in scientific practice, advocating for inference to be treated as a governed activity with explicit safeguards against unobservable biases. The findings also underscore the importance of developing new methodologies and diagnostic tools to detect and address reliability drift in data-driven research.
View on arXiv

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung
  • TIDE introduces a serving-engine-native framework for adaptive speculative decoding, maintaining draft–target alignment under dynamic workloads.
  • The framework enables zero-overhead training data generation by reusing hidden states computed during inference, eliminating the need to reload or recompute the target model.
  • Adaptive runtime control mechanisms dynamically decide when to activate speculative decoding and training, optimizing resource usage.
  • TIDE supports heterogeneous GPU utilization by decoupling inference and training, improving system efficiency.
  • The prototype implementation of TIDE achieves up to 1.15Ă— throughput improvement and reduces draft training time by 1.67Ă— compared to existing methods.
Read More
Abstract
This paper introduces TIDE (Temporal Incremental Draft Engine), a novel framework designed to improve the efficiency of large language model (LLM) inference through adaptive speculative decoding. Speculative decoding accelerates LLM inference by using a smaller draft model to propose multiple tokens, which are then verified by a larger target model. However, its effectiveness is highly dependent on the alignment between the draft and target models, which can degrade under dynamic, non-stationary workloads. TIDE addresses this challenge by incrementally adapting the draft model based on recent inference behavior, leveraging short-term temporal locality in workloads. It achieves this without additional computational overhead by reusing hidden states generated during inference as training signals. TIDE also incorporates adaptive runtime control to determine when speculative decoding and training are beneficial, avoiding unnecessary overhead. Furthermore, it decouples inference and training processes, enabling efficient use of heterogeneous GPU clusters. The framework demonstrates significant throughput improvements and reduced training overhead across diverse real-world workloads.
Methodology
TIDE integrates speculative decoding with online draft model adaptation into a high-performance inference engine. It reuses hidden states from the target model during inference to generate training signals without additional overhead. Adaptive runtime control mechanisms are employed to activate speculative decoding and training only when beneficial. The framework also decouples inference and training processes, allowing them to run on different GPU types for efficient resource utilization.
Results
TIDE achieves up to 1.15Ă— throughput improvement over static speculative decoding methods and reduces draft training time by 1.67Ă— compared to approaches that recompute training signals. The framework demonstrates consistent performance improvements across diverse real-world workloads, showcasing its ability to adapt to dynamic inference demands.
Implications
TIDE has significant implications for deploying LLMs in production environments, particularly in scenarios with dynamic and evolving workloads. By improving inference efficiency and reducing computational overhead, TIDE can enable faster and more cost-effective deployment of LLMs for tasks such as natural language processing, code generation, and other reasoning-intensive applications. Its ability to leverage heterogeneous hardware resources also makes it a practical solution for large-scale, resource-constrained systems.
View on arXiv

Unbiased Single-Queried Gradient for Combinatorial Objective

Thanawat Sornwanee
  • Introduces the Easy Stochastic Gradient (ESG) algorithm for unbiased gradient estimation in combinatorial optimization problems.
  • ESG requires only a single query to the black-box oracle per Monte Carlo sample, significantly reducing computational costs.
  • The method uses a product-Bernoulli relaxation to convert discrete optimization problems into smooth, continuous ones.
  • ESG generalizes existing methods like REINFORCE and provides a new class of stochastic gradient estimators.
  • The proposed approach is well-suited for high-dimensional problems where oracle queries are expensive.
Read More
Abstract
This paper addresses the challenge of gradient estimation in combinatorial optimization problems where the objective function is defined over binary variables and can only be accessed via a black-box oracle. Traditional methods for estimating gradients in such settings often require multiple queries or suffer from high variance, making them computationally expensive or inefficient. The author proposes a novel stochastic gradient estimation method called the Easy Stochastic Gradient (ESG) algorithm, which provides an unbiased gradient estimate using only a single query to the oracle. The method leverages a product-Bernoulli relaxation to transform the discrete optimization problem into a continuous one, enabling gradient-based optimization. ESG incorporates pathwise differentiation and importance sampling to ensure unbiased gradient estimation while maintaining computational efficiency. The paper demonstrates that ESG generalizes existing methods like REINFORCE and introduces new stochastic gradient estimators. The proposed method is particularly useful in scenarios where oracle queries are costly, such as preference optimization or high-dimensional integer programming.
Methodology
The paper reformulates combinatorial optimization problems using a product-Bernoulli relaxation, enabling the use of continuous optimization techniques. The Easy Stochastic Gradient (ESG) algorithm is introduced, which constructs a single-query stochastic gradient estimator using pathwise differentiation and importance sampling. The algorithm ensures unbiased gradient estimation by leveraging autodifferentiation and a carefully designed computational graph.
Results
The ESG algorithm is shown to produce unbiased gradient estimates with only one oracle query per Monte Carlo sample. It also provides an unbiased estimate of the objective function value itself. The paper demonstrates that ESG generalizes existing methods like REINFORCE and achieves computational efficiency in scenarios where traditional methods are prohibitive.
Implications
The proposed ESG algorithm has significant implications for combinatorial optimization problems in fields such as integer programming, preference optimization, and machine learning. By reducing the computational cost of gradient estimation, ESG enables scalable optimization in high-dimensional settings where oracle queries are expensive or limited.
View on arXiv

Visualizing the loss landscapes of physics-informed neural networks

Conor Rowan, Finn Murphy-Blanchard
  • Loss landscapes of physics-informed neural networks share properties with those of data-driven machine learning tasks, including smoothness and convexity near solutions.
  • The Deep Ritz method and squared residual loss formulations yield similar loss landscapes, contrary to expectations.
  • Loss landscape visualization techniques can provide valuable insights into the optimization dynamics of physics-informed machine learning models.
Read More
Abstract
This paper investigates the loss landscapes of physics-informed neural networks (PINNs), which are trained using loss functions derived from differential operators rather than large datasets. The authors provide a comprehensive review of existing literature on loss landscape visualization, primarily focused on data-driven machine learning tasks like image classification. They extend these techniques to the domain of scientific machine learning, specifically comparing two formulations of physics-based loss functions: the Deep Ritz method and the squared residual form. Through empirical analysis, the study reveals that the loss landscapes of PINNs exhibit properties similar to those observed in traditional machine learning tasks, such as smoothness, well-conditioning, and convexity near solutions. Unexpectedly, the two physics-informed loss formulations often produce comparable landscapes, challenging assumptions about the complexity of PINN optimization. The work aims to introduce loss landscape visualization techniques to the scientific machine learning community and provide insights into the optimization dynamics of PINNs.
Methodology
The authors conducted a literature review of loss landscape studies and applied visualization techniques to analyze the loss landscapes of PINNs. They empirically compared two physics-based loss formulations (Deep Ritz and squared residual) using methods such as monotonic linear interpolation, Hessian eigenvalue analysis, and exploration of solution manifolds. They also examined intrinsic dimensionality and optimization trajectories to characterize the landscapes.
Results
The study found that the loss landscapes of PINNs are smooth, well-conditioned, and convex near solutions, similar to data-driven machine learning tasks. Both the Deep Ritz and squared residual loss formulations produced comparable landscapes, suggesting that the choice of formulation may not significantly impact optimization dynamics. Additionally, the authors observed no evidence of problematic local minima in the loss landscapes.
Implications
['The findings challenge the assumption that physics-informed neural networks have inherently complex loss landscapes, suggesting that optimization may be more straightforward than previously thought.', 'Loss landscape visualization techniques can be leveraged to better understand and improve the training of PINNs, potentially enhancing their performance in solving differential equations and modeling physical systems.', 'The study bridges the gap between traditional machine learning and scientific machine learning, encouraging further exploration of optimization dynamics in physics-informed models.']
View on arXiv

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang
  • ZeroS eliminates the uniform zero-order term in softmax attention, addressing the bias and enabling sharper attention distributions.
  • The mechanism supports both positive and negative weights, allowing for contrastive operations within a single attention layer.
  • ZeroS achieves linear complexity (O(N)) while theoretically expanding the expressivity of attention mechanisms beyond convex combinations.
  • Empirical evaluations show ZeroS matches or exceeds the performance of standard softmax attention across multiple benchmarks.
  • The proposed method is mathematically stable and scalable for long-context scenarios.
Read More
Abstract
This paper introduces Zero-Sum Linear Attention (ZeroS), a novel linear attention mechanism designed to address key limitations of existing linear attention methods in Transformers. Linear attention methods reduce computational complexity from O(N^2) to O(N), but often underperform compared to standard softmax attention due to two fundamental issues: the restriction to convex combinations, which limits expressivity to additive information blending, and uniform weight bias, which dilutes attention in long contexts. ZeroS resolves these issues by removing the constant zero-order term in softmax attention and reweighting residuals to enable zero-sum weights. This allows for both positive and negative values, enabling contrastive operations within a single attention layer. ZeroS maintains linear complexity while expanding the set of representable functions and achieving performance comparable to or better than standard softmax attention across various sequence modeling benchmarks.
Methodology
ZeroS modifies the softmax attention mechanism by subtracting the uniform zero-order term, creating zero-sum weights that support signed values. It employs radial-angular decoupling to separate magnitude and direction, reintroducing directional effects through signed cos θ terms. The implementation uses separable logits and gating combined with linearizable angular computations via prefix sums, ensuring O(Nd^2) runtime and O(d^2) memory.
Results
ZeroS demonstrates superior performance compared to existing linear attention methods and matches or exceeds standard softmax attention on various sequence modeling benchmarks. It maintains linear time complexity while achieving sharper attention distributions and greater expressivity.
Implications
ZeroS has significant implications for efficient Transformer architectures, particularly in applications requiring long-context modeling such as natural language processing, vision, and speech tasks. Its ability to perform contrastive operations and maintain computational efficiency makes it suitable for large-scale systems and real-time applications.
View on arXiv