gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-03 • Found 24 papers

Bayesian Integration of Nonlinear Incomplete Clinical Data

Lucía González-Zamorano, Nuria Balbás-Esteban, Vanessa Gómez-Verdejo, Albert Belenguer-Llorens, Carlos Sevilla-Salcedo
  • BIONIC is a Bayesian framework for integrating multimodal clinical data with structured missingness.
  • It uses pretrained embeddings for unstructured modalities and directly incorporates structured clinical variables.
  • The model explicitly handles missingness at the variable, modality, and label levels, enabling robust imputation and semi-supervised learning.
  • Sparsity-inducing priors adaptively reduce the dimensionality of embeddings, improving performance in small-cohort settings.
  • BIONIC provides intrinsic interpretability, enabling population-level analysis of modality relevance and clinically meaningful insights.
Read More
Abstract
This paper introduces BIONIC (Bayesian Integration of Nonlinear Incomplete Clinical data), a probabilistic framework designed to address the challenges of multimodal clinical data integration under structured missingness. Clinical datasets often feature high dimensionality, heterogeneous modalities (e.g., medical images, clinical text, structured records), and non-random missing data due to real-world constraints. BIONIC leverages pretrained embeddings for complex modalities and integrates them with structured clinical variables through a Bayesian generative-discriminative architecture. The framework explicitly models missingness at the variable, modality, and label levels, enabling robust learning in partially observed and semi-supervised settings. BIONIC also incorporates sparsity-inducing priors to adaptively reduce the dimensionality of embeddings, improving performance in limited-cohort scenarios. The model provides intrinsic interpretability by propagating relevance from the latent space to input features, supporting clinically meaningful insights. Experiments on three multimodal clinical datasets demonstrate that BIONIC outperforms baseline methods in predictive accuracy, particularly under incomplete data conditions, while also offering principled uncertainty quantification and interpretability.
Methodology
BIONIC employs a Bayesian generative-discriminative architecture that integrates heterogeneous multimodal data using pretrained embeddings for complex modalities (e.g., medical images, clinical text) and structured clinical variables. Missingness is explicitly modeled at multiple levels (variable, modality, label) within a shared latent space, enabling probabilistic imputation and semi-supervised learning. Sparsity-inducing priors are applied to adaptively reduce the dimensionality of embeddings, and interpretability is achieved through relevance propagation from the latent space to input features.
Results
BIONIC was evaluated on three multimodal clinical and biomedical datasets, demonstrating superior predictive performance compared to baseline multimodal methods, particularly in scenarios with incomplete data. The framework also provided robust uncertainty quantification and intrinsic interpretability, enabling clinically meaningful insights into modality relevance.
Implications
BIONIC has significant potential for real-world clinical applications, where data is often incomplete and heterogeneous. Its ability to handle structured missingness, provide robust predictions, and offer interpretable insights makes it a valuable tool for clinical decision-making, personalized medicine, and population-level analysis. Additionally, its Bayesian formulation and use of pretrained embeddings make it adaptable to various clinical domains and data types.
View on arXiv

Bridging Time and Frequency: A Joint Modeling Framework for Irregular Multivariate Time Series Forecasting

Xiangfei Qiu, Kangjia Yan, Xvyuan Liu, Xingjian Wu, Jilin Hu
  • TFMixer is a unified framework that bridges time-domain and frequency-domain modeling for irregular multivariate time series forecasting.
  • The Global Frequency Module introduces a learnable Non-Uniform Discrete Fourier Transform (NUDFT) to handle irregular timestamps without interpolation.
  • The Local Time Module addresses information density imbalance using a query-based patch mixing mechanism to aggregate temporal features selectively.
  • TFMixer explicitly decouples global periodic structures from local temporal dynamics, enabling more accurate forecasting.
  • Extensive experiments show that TFMixer outperforms state-of-the-art methods on multiple real-world IMTS datasets.
Read More
Abstract
This paper introduces TFMixer, a novel framework for forecasting irregular multivariate time series (IMTS). IMTS data, characterized by non-uniform sampling and asynchronous variables, poses significant challenges for traditional time-series models. TFMixer addresses these challenges by combining time-domain and frequency-domain modeling in a unified framework. The proposed model includes a Global Frequency Module, which employs a learnable Non-Uniform Discrete Fourier Transform (NUDFT) to extract spectral representations directly from irregular timestamps, and a Local Time Module, which uses a query-based patch mixing mechanism to handle unevenly distributed temporal observations. These modules operate in parallel, and their outputs are fused to generate forecasts. Additionally, TFMixer leverages inverse NUDFT for explicit seasonal extrapolation. Extensive experiments on real-world datasets demonstrate that TFMixer achieves state-of-the-art performance, outperforming existing methods in IMTS forecasting tasks.
Methodology
TFMixer combines two parallel modules: (1) the Global Frequency Module, which uses a learnable NUDFT to extract spectral representations from irregularly sampled data, and (2) the Local Time Module, which employs a query-based patch mixing mechanism to aggregate temporal features from unevenly distributed observations. The outputs of these modules are fused in the Output Module to generate forecasts, with inverse NUDFT used for seasonal extrapolation. This design explicitly separates global and local temporal dependencies for improved modeling.
Results
TFMixer achieves state-of-the-art performance on multiple real-world IMTS datasets, demonstrating superior accuracy compared to existing methods. The model effectively captures both global periodic structures and local temporal dynamics, addressing key challenges in irregular time-series forecasting.
Implications
TFMixer has significant implications for domains where irregular multivariate time series data is prevalent, such as healthcare, environmental monitoring, and finance. By enabling accurate forecasting under irregular sampling conditions, TFMixer can support proactive decision-making, improve system stability, and enhance predictive analytics in these critical applications.
View on arXiv

Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li, Jing Liu, Jian Cheng
  • The authors identify a novel pattern in fine-grained MoE routing: a 'certain head' of high-confidence experts and an 'uncertain tail' of low-confidence experts.
  • They propose 'Expert-Sample,' a method that combines deterministic activation of high-confidence experts with stochastic sampling of low-confidence experts to enhance diversity and stability during test-time scaling.
  • Expert-Sample is a training-free, plug-and-play method that complements existing token-level sampling techniques.
  • The method improves pass@n and verification-based accuracy across multiple fine-grained MoE models and tasks, including math, knowledge reasoning, and code generation.
  • On Qwen3-30B-A3B-Instruct, Expert-Sample improves pass@32 from 85.4% to 91.9% and verification-based accuracy from 59.1% to 62.6%.
Read More
Abstract
This paper introduces 'Expert-Sample,' a novel test-time scaling method for fine-grained Mixture-of-Experts (MoE) models, which are characterized by hundreds of experts per layer and multi-expert activation per token. The authors identify a key pattern in MoE routing: a 'certain head' of high-confidence experts and an 'uncertain tail' of low-confidence experts. They propose leveraging this structure to improve performance during inference by deterministically activating high-confidence experts while introducing controlled stochasticity in the low-confidence tail. This approach enhances diversity in reasoning paths without sacrificing stability. The method is training-free, requires no architectural modifications, and complements existing token-level sampling techniques. Experimental results on tasks such as math, knowledge reasoning, and code generation demonstrate significant improvements in pass@n and verification-based accuracy across multiple fine-grained MoE models, including Qwen3-30B-A3B-Instruct.
Methodology
The authors conduct an empirical study of fine-grained MoE routing, identifying a pattern of high-confidence ('certain head') and low-confidence ('uncertain tail') experts. They propose the Expert-Sample method, which deterministically activates top-ranked experts while sampling from the low-confidence tail using temperature-scaled router logits. This method is applied at each layer during inference and preserves original gating weights for output aggregation. The approach is evaluated on multiple fine-grained MoE models and benchmarks, including Qwen3-30B-A3B-Instruct.
Results
Expert-Sample consistently improves performance across tasks and models. For example, on the Qwen3-30B-A3B-Instruct model evaluated on the GPQA-Diamond dataset, pass@32 increased from 85.4% to 91.9%, and verification-based accuracy improved from 59.1% to 62.6%. The method also demonstrated enhanced structural diversity in reasoning paths, enabling the discovery of correct solutions that standard token-level sampling often misses.
Implications
The proposed Expert-Sample method provides a new dimension for test-time scaling in fine-grained MoE models, enabling more effective exploration of solution spaces without sacrificing output stability. This approach has potential applications in improving the performance of large language models on complex reasoning tasks, including mathematical problem-solving, knowledge reasoning, and code generation. Its training-free and plug-and-play nature makes it a practical addition to existing inference pipelines.
View on arXiv

DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

Hon Tik Tse, Marlos C. Machado
  • The Default Representation (DR) is a reward-aware extension of the successor representation, capturing rewards obtained between states.
  • DROGO introduces a scalable objective to directly approximate the principal eigenvector of the DR using neural networks.
  • The method adapts the graph-drawing objective (GDO) in log-space and employs natural gradient optimization for stability.
  • DROGO is computationally efficient and scales to high-dimensional state representations, such as pixel-based inputs.
  • Empirical results show that DROGO effectively learns eigenvectors for reward shaping and other RL applications.
Read More
Abstract
This paper introduces DROGO (Default Representation Objective via Graph Optimization), a novel approach to efficiently compute the principal eigenvector of the Default Representation (DR) in reinforcement learning (RL). The DR is a reward-aware generalization of the successor representation (SR) and has been shown to be effective in applications such as reward shaping, exploration, and option discovery. However, prior methods for computing the DR's principal eigenvector rely on approximating the full DR matrix followed by eigendecomposition, which is computationally expensive and infeasible for high-dimensional environments. DROGO addresses this limitation by deriving an objective function that allows a neural network to directly approximate the principal eigenvector of the DR. The method leverages a reparameterized graph-drawing objective (GDO) in log-space and incorporates natural gradient optimization to stabilize training. Empirical evaluations demonstrate DROGO's robustness across various grid-world environments and its utility in reward shaping tasks.
Methodology
The authors derive a novel objective function based on the graph-drawing objective (GDO) to approximate the principal eigenvector of the DR. This objective is reparameterized in log-space to address numerical instabilities, and natural gradient optimization is used to account for the geometry of the log-space. The method replaces traditional quadratic norm constraints with an alternative constraint, providing theoretical justification for this choice. Neural networks are trained to optimize this objective, enabling the direct computation of the DR's principal eigenvector without requiring full matrix approximation or eigendecomposition.
Results
DROGO was evaluated in grid-world environments with different state representations, including coordinates and pixels. The method demonstrated robustness and scalability, successfully learning the principal eigenvector of the DR in all tested scenarios. Additionally, the learned eigenvectors were applied to reward shaping, showing improved performance compared to baseline methods. The results highlight DROGO's ability to handle high-dimensional state spaces and its practical utility in RL tasks.
Implications
DROGO provides a scalable and efficient approach to compute reward-aware representations in reinforcement learning, enabling its application in high-dimensional environments. This has implications for tasks such as reward shaping, exploration, option discovery, and transfer learning. By addressing the computational limitations of prior methods, DROGO opens up new possibilities for leveraging the Default Representation in complex RL settings.
View on arXiv

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman
  • Introduces 'Tag-Along Attacks,' a novel threat model where adversaries manipulate safety-aligned agents through conversation to execute prohibited actions.
  • Proposes SLINGSHOT, a reinforcement learning framework that autonomously discovers attack strategies under strict black-box constraints.
  • Reveals that learned attacks converge to short, instruction-like patterns rather than multi-turn persuasion, making them efficient and transferable.
  • Demonstrates high attack success rates (e.g., 67.0% against Qwen2.5-32B-Instruct-AWQ and 56.0% against Gemini 2.5 Flash) across various agent families, including closed-source and defensively fine-tuned models.
  • Establishes the importance of verifiable, optimization-driven safety evaluation in agentic environments, leveraging objective ground-truth metrics.
Read More
Abstract
This paper introduces a novel threat model called 'Tag-Along Attacks,' where a tool-less adversary manipulates a safety-aligned autonomous agent (Operator) into executing prohibited actions through conversational inputs alone. The authors propose SLINGSHOT, a reinforcement learning (RL) framework designed to autonomously discover and exploit vulnerabilities in agentic systems. SLINGSHOT operates under strict black-box constraints, interacting with the Operator only through conversational exchanges without access to internal states or environment data. The study demonstrates that SLINGSHOT can effectively learn attack strategies that exploit gaps in safety training, achieving high success rates across multiple agent families, including closed-source and defensively fine-tuned models. The work highlights the unique vulnerabilities of agentic systems, particularly in multi-turn, tool-augmented environments, and emphasizes the need for robust safety measures in the face of such adversarial threats.
Methodology
The authors formalize the Tag-Along Attack threat model within a two-agent system, where a smaller adversary (SLINGSHOT) manipulates a larger, safety-aligned Operator. SLINGSHOT is trained using a 'cold-start' reinforcement learning framework to discover attack strategies autonomously. The framework operates under strict black-box constraints, with SLINGSHOT only interacting with the Operator via conversational inputs and observing its responses. Success is defined as the Operator executing actions it would normally refuse, without detecting manipulation. The approach is validated in a tool-integrated environment with sensitive data, and the learned attack strategies are tested across multiple agent families to assess transferability.
Results
SLINGSHOT achieved a 67.0% success rate against the Qwen2.5-32B-Instruct-AWQ Operator on extreme-difficulty tasks, significantly outperforming the 1.7% baseline. It reduced the expected attempts to first success from 52.3 to 1.3 on solved tasks. The framework also demonstrated strong zero-shot transferability, achieving a 56.0% success rate against the closed-source Gemini 2.5 Flash model and a 39.2% success rate against the defensively fine-tuned Meta-SecAlign-8B model. These results highlight the effectiveness of SLINGSHOT in discovering and exploiting vulnerabilities in agentic systems.
Implications
The study underscores the growing risks posed by adversarial attacks in autonomous agent systems, particularly as these systems become more integrated into sensitive, tool-augmented environments. The findings highlight the need for more robust safety training and evaluation methods that go beyond superficial alignment to address emergent vulnerabilities. The proposed SLINGSHOT framework also provides a valuable tool for automated red-teaming, enabling systematic discovery of adversarial inputs and contributing to the development of safer AI systems.
View on arXiv

Expanding the Capabilities of Reinforcement Learning via Text Feedback

Yuda Song, Lili Chen, Fahim Tajwar, Rémi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
  • The paper formalizes RL from Text Feedback (RLTF), a framework that uses text feedback during training to improve single-turn test-time performance.
  • Two methods are proposed: RLTF-SD (Self Distillation) and RLTF-FM (Feedback Modeling), which internalize feedback to enhance learning.
  • Theoretical analysis supports the design choices of RLTF-SD and RLTF-FM, particularly in their ability to transform feedback into effective supervision.
  • Empirical evaluations on benchmarks like Reasoning Gym, MATH500, and WritingBench show significant performance improvements over traditional RL methods.
  • The approach highlights the potential of leveraging abundant, natural-language feedback to make RL more efficient and scalable.
Read More
Abstract
This paper introduces a novel framework called Reinforcement Learning from Text Feedback (RLTF), which leverages natural language feedback to improve reinforcement learning (RL) models. Traditional RL methods often rely on sparse scalar rewards, which provide limited information for learning, or on costly demonstrations for dense supervision. RLTF bridges this gap by using text feedback, which is richer than scalar rewards but less expensive than full demonstrations. The authors propose two methods to internalize text feedback during training: Self Distillation (RLTF-SD) and Feedback Modeling (RLTF-FM). RLTF-SD uses feedback-conditioned second-turn outputs as implicit demonstrations to improve single-turn policy performance, while RLTF-FM predicts feedback as an auxiliary task to enhance representation learning. The paper provides theoretical analysis for both methods and evaluates them on diverse benchmarks, including reasoning puzzles, competition math, and creative writing tasks. Results show that RLTF-SD and RLTF-FM outperform baseline methods, demonstrating the potential of text feedback to expand the capabilities of RL systems.
Methodology
The authors propose two methods for leveraging text feedback in RL: (1) RLTF-SD, which trains the single-turn policy by treating feedback-conditioned second-turn outputs as implicit demonstrations, and (2) RLTF-FM, which predicts feedback as an auxiliary objective to improve representation learning. Both methods are analyzed theoretically and evaluated empirically on diverse benchmarks. The experiments compare these methods against strong baselines using scalar rewards and text feedback.
Results
The proposed methods, RLTF-SD and RLTF-FM, consistently outperform baseline approaches across multiple benchmarks, including reasoning puzzles, competition math, and creative writing tasks. These methods demonstrate improved single-turn test-time performance by effectively internalizing feedback during training. The results highlight the scalability and efficiency of using text feedback as a rich supervision signal.
Implications
This work has significant implications for improving RL systems, particularly in settings where traditional scalar rewards are insufficient or demonstrations are costly. By leveraging abundant text feedback, RLTF can enhance the performance of large language models (LLMs) in tasks requiring reasoning, creativity, and problem-solving. The approach could be applied to chatbot refinement, automated tutoring systems, and other domains where human feedback is naturally available.
View on arXiv

From Perception to Action: Spatial AI Agents and World Models

Gloria Felicia, Nolan Bryant, Handi Putra, Ayaan Gazali, Eliel Lobo, Esteban Rojas
  • The paper introduces a three-axis taxonomy connecting agentic AI capabilities (memory, planning, tool use) with spatial intelligence tasks (navigation, manipulation, geospatial analysis) across spatial scales (micro, meso, macro).
  • Hierarchical memory systems are critical for long-horizon spatial tasks, enabling agents to accumulate and utilize experience effectively.
  • Integrating graph neural networks (GNNs) with large language models (LLMs) shows promise for structured spatial reasoning.
  • World models are essential for safe deployment of AI systems across different spatial scales, from centimeter-level manipulation to kilometer-scale urban planning.
  • The authors identify six grand challenges and advocate for unified evaluation frameworks to standardize cross-domain assessments in spatial AI.
Read More
Abstract
This paper addresses the gap between symbolic reasoning capabilities of large language models (LLMs) and the spatial intelligence required for embodied AI systems to perceive, reason, and act in physical environments. The authors propose a unified three-axis taxonomy that connects agentic AI capabilities (memory, planning, and tool use) with spatial intelligence tasks (navigation, manipulation, and geospatial analysis) across three spatial scales (micro, meso, and macro). By reviewing over 2,000 papers and citing 742 works, the authors identify key challenges and opportunities in integrating agentic reasoning with spatial tasks. They emphasize the importance of hierarchical memory systems for long-horizon spatial tasks, the integration of graph neural networks (GNNs) with LLMs for structured spatial reasoning, and the role of world models in ensuring safe deployment across spatial scales. The paper concludes with six grand challenges and calls for unified evaluation frameworks to standardize cross-domain assessments, aiming to advance spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.
Methodology
The authors conducted an extensive literature review of over 2,000 papers, citing 742 works from top-tier venues. They developed a three-axis taxonomy to unify agentic AI capabilities, spatial intelligence tasks, and spatial scales. The taxonomy was used to analyze existing research and identify gaps, challenges, and promising directions for future work.
Results
The analysis revealed that hierarchical memory systems are crucial for long-horizon tasks, GNN-LLM integration is a promising approach for spatial reasoning, and world models are necessary for safe deployment across spatial scales. The taxonomy provides a structured framework to guide future research and system design in spatial AI.
Implications
The proposed taxonomy and findings have significant implications for advancing spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence. By bridging the gap between perception and action, the framework can guide the development of more robust, generalizable, and safe AI systems capable of operating in complex physical environments.
View on arXiv

Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization

Haochen You, Heng Zhang, Hongyang He, Yuqi Li, Baojing Liu
  • GRIT-VQ introduces a radius-based surrogate for hard vector quantization, enabling differentiable updates while maintaining hard assignments during inference.
  • An integrated transform mechanism ties codebook updates through shared parameters, promoting coordination and reducing code collapse.
  • Theoretical analysis establishes conditions for stable gradient flow and improved codebook utilization across diverse quantizers.
  • GRIT-VQ improves performance metrics such as reconstruction error, generative quality, and recommendation accuracy across benchmarks.
  • The framework addresses codebook under-utilization and training instability common in large-scale VQ systems.
Read More
Abstract
This paper introduces GRIT-VQ (Generalized Radius & Integrated Transform-Vector Quantization), a novel framework for differentiable vector quantization (VQ) that addresses the limitations of traditional VQ methods, such as non-differentiable hard nearest-neighbor assignments and codebook under-utilization. GRIT-VQ employs a generalized radius-based surrogate for hard assignments, enabling smooth and geometry-aware updates during backpropagation while preserving hard assignments in the forward pass. Additionally, it introduces an integrated transform mechanism that updates all codebook entries through shared parameters, promoting coordinated evolution and mitigating collapse into dominant codes. The authors provide theoretical analysis to ensure stable gradient flow and improved codebook utilization. Experiments across image reconstruction, image generation, and recommendation benchmarks demonstrate that GRIT-VQ consistently improves reconstruction error, generative quality, recommendation accuracy, and codebook usage compared to existing VQ methods.
Methodology
The authors propose GRIT-VQ, which replaces the standard straight-through estimator with a generalized radius-based surrogate for latent updates. This surrogate allows flexible control over update magnitude while preserving the quantization direction. Additionally, a data-agnostic integrated transform mechanism updates all codebook entries via shared low-dimensional parameters, ensuring coordinated evolution of the codebook. The framework is tested in autoencoding and tokenization architectures across multiple benchmarks.
Results
GRIT-VQ consistently outperforms existing VQ methods in image reconstruction, image generation, and recommendation tasks. It achieves lower reconstruction error, higher generative quality, and improved recommendation accuracy. Furthermore, it significantly increases codebook utilization, addressing the issue of collapse into dominant codes observed in traditional VQ approaches.
Implications
GRIT-VQ has potential applications in generative modeling, semantic tokenization, and recommendation systems, where discrete representations are crucial. Its ability to improve codebook utilization and training stability could benefit large-scale systems requiring efficient and robust vector quantization. Additionally, its geometry-aware updates and coordinated codebook evolution may inspire further advancements in differentiable quantization techniques.
View on arXiv

High-accuracy sampling for diffusion models and log-concave distributions

Fan Chen, Sinho Chewi, Constantinos Daskalakis, Alexander Rakhlin
  • Introduces the First-Order Rejection Sampling (FORS) algorithm, achieving polylog(1/δ) complexity for sampling tasks.
  • Demonstrates exponential improvement in sampling efficiency compared to prior methods, particularly for diffusion models and log-concave distributions.
  • Handles minimal data assumptions, non-uniform Lipschitz conditions, and low intrinsic dimensionality, making the method broadly applicable.
  • Achieves δ-error with significantly fewer gradient evaluations, even under noisy score estimates.
  • Provides theoretical guarantees and improves upon existing results in diffusion-based generative modeling.
Read More
Abstract
This paper introduces a novel meta-algorithm, First-Order Rejection Sampling (FORS), which achieves high-accuracy sampling for diffusion models and log-concave distributions using only gradient (score) evaluations. The authors address a long-standing challenge in sampling: achieving polylogarithmic complexity in the target accuracy δ, which represents an exponential improvement over prior methods. The proposed approach is applicable under minimal assumptions on the data distribution and the score error, and it significantly reduces the computational complexity of sampling tasks. The paper demonstrates that FORS achieves δ-error in polylog(1/δ) steps for various settings, including minimal data assumptions, non-uniform Lipschitz conditions, and low intrinsic dimensionality of the data. These results provide a breakthrough in the theoretical understanding of diffusion-based generative models and log-concave sampling, offering new efficiency guarantees while maintaining robustness to score estimation errors.
Methodology
The authors propose the FORS algorithm, which simulates rejection sampling using only first-order (gradient) queries. The approach leverages L2-accurate score estimates to achieve high-accuracy guarantees. Theoretical analysis is conducted to derive complexity bounds under different assumptions, including minimal data assumptions, non-uniform Lipschitz conditions, and intrinsic dimensionality. The algorithm is designed to minimize the number of gradient evaluations while maintaining robustness to score estimation errors.
Results
The FORS algorithm achieves δ-error in polylog(1/δ) steps under various conditions: (1) O(d log²(M/δ) + log³(M/δ)) complexity under minimal data assumptions, where M is a function of the data's second moment; (2) O(√dL log³/²(M/δ) + L log²(M/δ)) complexity under non-uniform Lipschitz conditions; and (3) O(d⋆ log²(M/δ) + log³(M/δ)) complexity for low-dimensional data distributions, where d⋆ is the intrinsic dimension. These results represent an exponential improvement over prior methods, which typically scale polynomially with 1/δ.
Implications
This work has significant implications for both theoretical and practical aspects of machine learning. The FORS algorithm provides a new foundation for efficient sampling in diffusion-based generative models, enabling faster and more accurate generation of high-quality samples. It also advances the field of log-concave sampling, with potential applications in Bayesian inference, optimization, and probabilistic modeling. The reduction in computational complexity could make high-accuracy sampling more accessible for large-scale and high-dimensional problems.
View on arXiv

Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment

Dionisia Naddeo, Jonas Linkerhägner, Nicola Toschi, Geri Skenderi, Veronica Lachi
  • HGNNs excel at recovering low-distortion representations compared to Euclidean models, but only when the task requires preserving the input graph's hyperbolic geometry.
  • The performance advantage of HGNNs depends on the alignment between the input graph's geometry and the downstream task, not solely on the graph's hyperbolicity.
  • Synthetic experiments and theoretical analyses reveal that increasing hidden dimensionality can reduce the gap between hyperbolic and Euclidean models in terms of distortion.
  • Link prediction tasks are more geometry-aligned with hyperbolic representations, while node classification tasks often do not benefit from hyperbolic geometry.
  • The study challenges the prevailing assumption that HGNNs are universally superior for hyperbolic graphs, emphasizing the importance of task-specific considerations.
Read More
Abstract
This paper investigates the effectiveness of Hyperbolic Graph Neural Networks (HGNNs) in graph representation learning by introducing the concept of geometry-task alignment. While HGNNs are often assumed to be advantageous for tree-like or hierarchical graphs due to their ability to embed such structures with low distortion, the authors argue that the alignment between the input graph's geometry and the downstream task is a critical factor. Through theoretical analysis, synthetic experiments, and evaluations on real-world datasets, the study demonstrates that HGNNs outperform Euclidean models only when the task explicitly benefits from preserving the hyperbolic metric structure. The paper highlights that the hyperbolicity of the input graph alone is insufficient to justify the use of HGNNs, as task-specific requirements play a significant role in determining their utility.
Methodology
The authors conducted controlled experiments on synthetic regression tasks to evaluate the ability of HGNNs to recover low-distortion representations. They also analyzed real-world datasets for link prediction and node classification tasks, jointly assessing predictive performance and embedding distortion. Theoretical analyses were used to support empirical findings, particularly regarding the relationship between hidden dimensionality and distortion.
Results
HGNNs consistently outperformed Euclidean models in tasks where preserving the hyperbolic metric structure was critical, such as link prediction. However, for tasks like node classification, where metric preservation was less relevant, the performance advantage of HGNNs diminished. The study also showed that increasing hidden dimensionality in Euclidean models could reduce the performance gap with HGNNs in terms of distortion.
Implications
The findings suggest that the choice of geometry for graph representation learning should be guided not only by the structural properties of the input graph but also by the specific requirements of the downstream task. This has implications for the design and application of graph neural networks in areas such as knowledge graphs, biological networks, and social hierarchies, where hyperbolic structures are common. The study also encourages a more nuanced evaluation of HGNNs in future research, moving beyond the assumption that hyperbolic graphs automatically warrant hyperbolic models.
View on arXiv

IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination

Yuanshuai Li, Yuping Yan, Jirui Han, Fei Ming, Lingjuan Lv, Yaochu Jin
  • Hallucination in MLLMs arises from an over-reliance on linguistic priors, leading to insufficient grounding in visual evidence.
  • IRIS introduces a novel on-policy preference alignment framework that uses implicit rewards from the model's own log-probability space, avoiding the limitations of discrete external feedback.
  • The framework employs Rectified Visual Guidance (RVG) scoring to construct self-generated preference pairs, enabling iterative refinement of the model's policy.
  • IRIS achieves strong performance on hallucination benchmarks with minimal data (5.7k samples) and no external evaluators, demonstrating its efficiency and scalability.
  • The proposed method retains fine-grained preference differences and avoids distributional discrepancies, ensuring stable optimization under KL-divergence constraints.
Read More
Abstract
This paper addresses the challenge of hallucination in Multimodal Large Language Models (MLLMs), where generated text contradicts visual evidence due to an imbalance between linguistic and visual modalities. The authors propose IRIS (Implicit Reward-Guided Internal Sifting), a novel framework that eliminates reliance on costly external evaluators by leveraging continuous implicit rewards derived from the model's own log-probability space. IRIS operates in two stages: an initial supervised fine-tuning (SFT) phase for visual consistency, followed by iterative preference optimization using self-generated preference pairs guided by Rectified Visual Guidance (RVG) scoring. This approach ensures that the alignment process remains grounded in the model's intrinsic generative distribution, addressing fine-grained conflicts between modalities. Experiments demonstrate that IRIS achieves competitive performance on hallucination benchmarks using only 5.7k samples, without requiring external feedback, making it an efficient and scalable solution for improving multimodal alignment.
Methodology
IRIS operates in two stages: (1) a supervised fine-tuning (SFT) phase to calibrate the model's latent distribution for visual consistency, and (2) an iterative preference optimization phase where the model generates candidate responses and uses its own implicit rewards to construct preference pairs via Rectified Visual Guidance (RVG) scoring. These pairs are optimized using multimodal Direct Preference Optimization (DPO) to refine the model's policy while maintaining alignment with its intrinsic generative distribution.
Results
IRIS achieves competitive performance on key hallucination benchmarks using only 5.7k samples, significantly reducing reliance on external evaluators. The approach demonstrates improved visual grounding and reduced hallucination rates compared to methods that rely on discrete external feedback. Additionally, IRIS ensures stable optimization under KL-divergence constraints and retains fine-grained preference differences, leading to more accurate multimodal alignment.
Implications
The proposed IRIS framework has significant implications for the development of more reliable and efficient multimodal AI systems. By eliminating the need for costly external evaluators and reducing data requirements, IRIS provides a scalable solution for mitigating hallucinations in MLLMs. This approach could enhance the performance of applications in areas such as visual question answering, image captioning, and other vision-language tasks, ultimately improving the trustworthiness and usability of multimodal AI systems.
View on arXiv

JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

Yebin Yang, Huaijin Wu, Fu Guo, Lin Yao, Xiaohan Qin, Jingzhi Wang, Debing Zhang, Junchi Yan
  • Introduces token-indexed parameters as a new scaling axis for LLMs, complementing dense and sparse scaling approaches.
  • Proposes JTok and JTok-M architectures that enhance Transformer layers with token-specific modulation vectors, incurring minimal computational overhead.
  • Demonstrates significant downstream task performance improvements (+4.1 on MMLU, +8.3 on ARC, +8.9 on CEval) and 35% compute savings compared to vanilla MoE models.
  • Validates the scalability of token-indexed parameters, showing predictable log-linear scaling behavior similar to dense parameters.
  • Efficient implementation ensures low training throughput loss (<7%) and negligible inference memory overhead, making the approach practical for real-world applications.
Read More
Abstract
This paper introduces token-indexed parameters as a novel scaling axis for large language models (LLMs), addressing inefficiencies in traditional dense scaling and sparse Mixture-of-Experts (MoE) architectures. The authors propose two architectures, Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with token-specific modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight element-wise operations, incurring negligible FLOPs overhead while enhancing model capacity. Extensive experiments demonstrate that JTok-M improves validation loss and downstream task performance across various model scales, including dense and MoE backbones. The approach achieves comparable model quality with 35% less compute relative to vanilla MoE architectures, fundamentally shifting the quality-compute Pareto frontier. Additionally, token-indexed parameters exhibit predictable power-law scaling behavior, confirming their scalability. Efficient implementation ensures minimal training throughput loss and negligible inference overhead, making the method practical for large-scale deployment.
Methodology
The authors propose JTok and JTok-M architectures that augment Transformer layers with token-specific modulation vectors retrieved from learned embedding tables. JTok applies lightweight Hadamard products to modulate the MLP residual, while JTok-M generalizes this by maintaining a pool of token-indexed modulators and using a router to select sparse mixtures per token. These modules are implemented with table lookups and element-wise operations, ensuring minimal computational overhead. Extensive experiments were conducted on dense and MoE backbones ranging from 650M to 61B parameters, with iso-compute analysis validating the efficiency and scalability of the approach.
Results
JTok-M consistently improves validation loss and downstream task performance across multiple benchmarks, including +4.1 on MMLU, +8.3 on ARC, and +8.9 on CEval. Iso-compute analysis shows that JTok-M achieves comparable model quality with 35% less compute compared to vanilla MoE architectures. Token-indexed parameters exhibit predictable log-linear scaling behavior, confirming their scalability. The efficient implementation ensures training throughput loss of less than 7% and inference latency increase of ≤7.3%, with no additional GPU memory footprint.
Implications
The proposed token-indexed scaling axis offers a practical and efficient method to enhance LLMs without increasing computational costs, making it suitable for large-scale deployment in resource-constrained environments. It has potential applications in improving model generalization, data efficiency, and downstream task performance across various domains, including natural language understanding, reasoning, and generation tasks. Additionally, the predictable scaling behavior provides a roadmap for future model expansion and optimization.
View on arXiv

Learning Half-Spaces from Perturbed Contrastive Examples

Aryan Alavi Razavi Ravari, Farnam Mansouri, Yuxin Chen, Valentio Iverson, Adish Singla, Sandra Zilles
  • Introduces a perturbed contrastive example oracle parameterized by a noise function f, which governs the quality of contrastive examples based on distance from the decision boundary.
  • Analyzes active and passive learning scenarios for one-dimensional thresholds and linear half-spaces under uniform distributions.
  • Shows that perturbed contrastive examples can reduce sample complexity compared to learning without contrastive examples, while avoiding trivialization of learning.
  • Provides theoretical results characterizing sample complexity and error bounds in dependence on the noise function f.
  • Highlights the practical relevance of perturbed contrastive examples for real-world machine learning applications.
Read More
Abstract
This paper investigates learning linear half-spaces using a perturbed contrastive example oracle, extending the idealized model introduced by Mansouri et al. (2025). The oracle provides labeled examples paired with contrastive examples of opposite labels, but in this work, the contrastive examples are perturbed based on a noise function f that depends on the distance of the queried point from the decision boundary. The study explores two settings: fixed and stochastic perturbation magnitudes. The authors analyze active and passive learning scenarios for one-dimensional thresholds and linear half-spaces under a uniform distribution on bounded domains. They demonstrate that under certain conditions on the noise function f, perturbed contrastive examples can significantly reduce sample complexity compared to learning without contrastive examples, while avoiding the trivialization of learning observed in the idealized model. The paper provides theoretical characterizations of sample complexity and error bounds, contributing to the development of more practical models for learning from contrastive examples.
Methodology
The authors extend the idealized contrastive example model by introducing a noise function f that perturbs the contrastive examples based on the distance of the queried point from the decision boundary. They analyze the sample complexity and error bounds for learning one-dimensional thresholds and linear half-spaces under uniform distributions, considering both active and passive learning settings. Two perturbation mechanisms are studied: fixed and stochastic magnitudes.
Results
The study demonstrates that perturbed contrastive examples can significantly improve learning efficiency by reducing sample complexity and expected error compared to learning without contrastive examples. The improvements depend on the choice of the noise function f, which governs the permissible perturbation. Unlike the idealized model, the perturbed oracle does not trivialize learning, making it more applicable to practical scenarios.
Implications
The proposed model bridges the gap between theoretical and practical learning from contrastive examples, offering a more realistic framework for applications such as recommender systems, NLP, causal inference, and program synthesis. By incorporating perturbations, the model aligns better with real-world constraints, potentially enabling more efficient and interpretable machine learning systems.
View on arXiv

Local Exponential Stability of Mean-Field Langevin Descent-Ascent in Wasserstein Space

Geuntaek Seo, Minseop Shin, Pierre Monmarché, Beomjun Choi
  • The paper proves the local exponential stability of the mean-field Langevin descent-ascent (MFL-DA) dynamics for nonconvex-nonconcave payoffs in Wasserstein space.
  • The authors establish a coercivity estimate for entropy near equilibrium using spectral analysis, revealing a local convex-concave structure.
  • The results provide explicit convergence rates and affirm stability under proximity assumptions in the Wasserstein metric.
  • The study addresses open questions about local stability and convergence rates posed by prior work, but global convergence remains unresolved.
  • The findings contribute to the understanding of Wasserstein geometry in optimization and gradient flows.
Read More
Abstract
This paper investigates the mean-field Langevin descent-ascent (MFL-DA) dynamics, a coupled optimization framework in the space of probability measures, for entropically regularized two-player zero-sum games. The authors address an open question posed by Wang and Chizat (COLT 2024) regarding the long-term behavior of MFL-DA for nonconvex-nonconcave payoffs. They prove that the unique mixed Nash equilibrium of the system is locally exponentially stable, meaning that if the initialization is sufficiently close to the equilibrium in the Wasserstein metric, the dynamics converge to the equilibrium at an exponential rate. The analysis leverages a coercivity estimate for entropy near equilibrium, derived via spectral analysis of the linearized operator, which reveals a local displacement convex-concave structure. This work resolves questions about local stability and convergence rates, leaving global convergence as an open challenge.
Methodology
The authors analyze the MFL-DA dynamics, which describe the evolution of probability measures in Wasserstein space, using tools from spectral analysis and Wasserstein geometry. They derive a coercivity estimate for entropy near equilibrium and demonstrate that this induces a local displacement convex-concave structure. The analysis is supported by formal gradient flow techniques and connections to functional inequalities like the log-Sobolev inequality.
Results
The paper proves that the unique mixed Nash equilibrium of the MFL-DA dynamics is locally exponentially stable. Specifically, if the initialization is sufficiently close to the equilibrium in the Wasserstein metric, the system converges to the equilibrium at an exponential rate. The convergence is quantified explicitly, and the results extend to stronger senses of convergence, such as relative entropy and Nikaido-Isoda error.
Implications
The results have implications for optimization in machine learning, particularly in settings involving two-player zero-sum games such as generative adversarial networks (GANs). The findings also advance the theoretical understanding of Wasserstein gradient flows and entropic regularization, potentially improving the robustness and tractability of minimax problems in nonconvex-nonconcave settings.
View on arXiv

LocalV: Exploiting Information Locality for IP-level Verilog Generation

Hanqi Lyu, Di Huang, Yaoyu Zhu, Kangcheng Liu, Bohan Dou, Chongxiao Li, Pengwei Jin, Shuyao Cheng, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
  • LocalV addresses three major challenges in IP-level Verilog generation: long-document handling, long-code generation, and complex debugging processes.
  • The framework is based on the information locality hypothesis, which posits that hardware modules can be implemented correctly using localized document segments rather than the entire specification.
  • LocalV employs a multi-step workflow, including document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided debugging.
  • Experiments on the REALBENCH benchmark show that LocalV achieves a pass rate of 45.0%, significantly outperforming state-of-the-art models.
  • The framework demonstrates scalability and robustness for real-world hardware design tasks, bridging the gap between academic benchmarks and industrial requirements.
Read More
Abstract
This paper introduces LocalV, a novel multi-agent framework designed to address the challenges of generating Register-Transfer Level (RTL) Verilog code for industrial IP-level hardware design. Traditional methods for RTL code generation are labor-intensive and error-prone, while existing Large Language Models (LLMs) and agent-based systems struggle with the complexity of real-world tasks. The authors identify three key challenges: handling long and detailed specifications, generating long and syntactically correct code, and managing the complex debugging cycles required for functional verification. LocalV leverages the inherent modularity and information locality of hardware design to decompose the long-document to long-code generation problem into smaller, manageable tasks. The framework includes hierarchical document partitioning, localized code generation, interface-consistent merging, and an Abstract Syntax Tree (AST)-guided debugging process. Experiments on the REALBENCH benchmark demonstrate that LocalV significantly outperforms state-of-the-art models, achieving a pass rate of 45.0% compared to 21.6%, showcasing its effectiveness in handling real-world IP-level Verilog generation tasks.
Methodology
LocalV utilizes a multi-agent framework that decomposes the long-document to long-code generation problem into smaller, localized tasks. The workflow includes hierarchical document partitioning to create manageable fragments, task planning to assign these fragments to sub-tasks, localized code generation for each sub-task, merging of code fragments with interface consistency, and AST-guided debugging to trace errors back to specific document segments. This modular approach leverages the inherent locality in hardware design to improve scalability and accuracy.
Results
LocalV achieves a pass rate of 45.0% on the REALBENCH benchmark, a significant improvement over the 21.6% pass rate of state-of-the-art models. The framework demonstrates superior performance in both syntactic and semantic correctness, particularly for long and complex IP-level Verilog generation tasks.
Implications
LocalV has the potential to revolutionize the RTL code generation process in digital hardware design by automating labor-intensive tasks and improving accuracy. Its modular and scalable approach could be applied to other domains requiring long-document to long-code generation, such as software engineering and system design. Additionally, its debugging pipeline could enhance verification workflows in hardware development.
View on arXiv

Localized, High-resolution Geographic Representations with Slepian Functions

Arjun Rao, Ruth Crasto, Tessa Ooms, David Rolnick, Konstantin Klemmer, Marc Rußwurm
  • Slepian functions are used to create geographic encoders that concentrate representational capacity within specific regions of interest, improving performance on localized tasks.
  • A hybrid Slepian-Spherical Harmonics encoder is introduced to balance local detail and global context, addressing the tradeoff in geospatial machine learning.
  • The proposed encoders outperform traditional methods across five diverse tasks, including classification, regression, and image-augmented prediction.
  • Slepian encoders are computationally efficient and memory-efficient compared to spherical harmonics, making them scalable to high-resolution applications.
  • The methodology is extendable to multiple localized regions and temporal data, broadening its applicability.
Read More
Abstract
This paper introduces a novel geographic location encoder based on Slepian functions to address the limitations of existing positional encoding methods in geospatial machine learning. Traditional encoders, such as spherical harmonics, distribute representational capacity uniformly across the globe, which limits their ability to capture fine-grained, localized patterns. The proposed Slepian-based encoder concentrates representational capacity within specific regions of interest (ROIs), enabling high-resolution and computationally efficient geographic representations. Additionally, the authors propose a hybrid Slepian-Spherical Harmonics (SH) encoder to balance local and global context, preserving global smoothness while enhancing local detail. The paper demonstrates the effectiveness of these encoders across five tasks, including classification, regression, and image-augmented prediction, where they outperform baseline methods. The proposed methods are computationally efficient, memory-friendly, and adaptable to both spatial and temporal data. The authors also provide open-source code for reproducibility.
Methodology
The authors leverage Slepian functions, which are band-limited basis functions that concentrate energy within a specific spatial or temporal region, as the foundation for their geographic location encoders. They also develop a hybrid Slepian-Spherical Harmonics encoder to combine localized high-resolution detail with global context. The encoders are evaluated on five tasks spanning classification, regression, and image-augmented prediction, using neural networks with the proposed positional encodings. Comparisons are made against baseline methods, including spherical harmonics and other positional encoders.
Results
The Slepian-based encoders consistently outperform baseline methods across all five tasks, demonstrating superior performance in capturing fine-grained spatial patterns. The hybrid Slepian-Spherical Harmonics encoder effectively balances local and global representation, addressing the tradeoff inherent in geospatial machine learning. Additionally, the proposed methods are computationally efficient and require less memory compared to spherical harmonics, making them suitable for high-resolution applications.
Implications
The proposed Slepian-based encoders have significant implications for geospatial machine learning, particularly in applications requiring high-resolution, localized predictions, such as disease outbreak modeling, ecological pattern analysis, and economic activity forecasting. The hybrid encoder's ability to balance local and global context could also benefit tasks that require both fine-grained detail and broader spatial awareness. Furthermore, the computational efficiency and scalability of these methods make them practical for real-world deployment in resource-constrained settings.
View on arXiv

Lossless Embedding Compression via Spherical Coordinates

Han Xiao
  • Achieves 1.5× lossless compression for unit-norm embeddings, outperforming prior methods by 25%.
  • Utilizes spherical coordinates to exploit the geometric structure of unit-norm vectors, reducing exponent entropy.
  • Applicable across diverse embedding types (text, image, multi-vector) without requiring training.
  • Fully lossless within float32 precision, ensuring bit-exact reconstruction.
  • Reduces storage requirements significantly, e.g., compressing a ColBERT index of 1 million documents from 240 GB to 160 GB.
Read More
Abstract
This paper introduces a novel lossless compression method for unit-norm embeddings, leveraging spherical coordinates to achieve a 1.5× compression ratio, which is 25% better than the previous state-of-the-art. The approach exploits the geometric properties of unit-norm vectors, which lie on the surface of a high-dimensional hypersphere, causing their spherical angular coordinates to concentrate around π/2. This concentration reduces the entropy of IEEE 754 exponents, enabling efficient entropy coding. Unlike prior methods that focus on lossy quantization or ignore the hyperspherical structure of embeddings, this technique is fully lossless within float32 precision and requires no training. The method was evaluated across 26 configurations, including text, image, and multi-vector embeddings, demonstrating consistent compression improvements. It is particularly beneficial for applications requiring bit-exact reconstruction, such as embedding caches, API serialization, and archival storage.
Methodology
The proposed method converts Cartesian coordinates of unit-norm embeddings into spherical coordinates, where angular values concentrate around π/2. This transformation reduces the entropy of IEEE 754 exponents. The pipeline involves spherical transformation, transposition to group same-angle values, byte shuffling to separate exponents, and entropy coding using zstd. Reconstruction reverses these steps with negligible error matching float32 machine epsilon.
Results
The method consistently achieved 1.5× compression across 26 embedding configurations spanning text, image, and multi-vector embeddings. For example, it reduced the storage of a ColBERT index of 1 million documents from 240 GB to 160 GB. Reconstruction error was negligible, matching float32 precision limits.
Implications
This compression technique has significant implications for large-scale AI systems, enabling efficient storage and transmission of embeddings without loss of precision. It is particularly useful for applications requiring exact reconstruction, such as embedding caches, API serialization, and archival storage. The method's general applicability across embedding types makes it a versatile solution for modern AI pipelines.
View on arXiv

MATRIX: A Multimodal Benchmark and Post-Training Framework for Materials Science

Delia McGrath, Curtis Chong, Rohil Kulkarni, Gerbrand Ceder, Adeesh Kolluru
  • MATRIX is the first multimodal benchmark for materials science that integrates both text and experimental imagery.
  • Post-training on aligned image-text pairs improves experimental interpretation by 10–25% and text-only reasoning by 5–16%.
  • The benchmark evaluates four task families: foundational theory reasoning, research-level reasoning, hypothesis generation, and experimental interpretation.
  • MATRIX demonstrates transferable gains to other scientific domains, including improvements on ScienceQA and PubMedQA.
  • The dataset and models are publicly available to support further research in multimodal scientific reasoning.
Read More
Abstract
The MATRIX paper introduces a novel multimodal benchmark and post-training framework designed to evaluate and enhance scientific reasoning in materials science. Named MATRIX (Materials Analysis for Theory, Reasoning, and Images from eXperiments), the benchmark integrates both textual and visual data to assess foundational theory, research-level reasoning, hypothesis generation, and experimental interpretation. Unlike previous materials science benchmarks that focus primarily on text-based tasks, MATRIX incorporates experimental imagery such as SEM micrographs, XRD patterns, EDS spectra, and TGA curves, enabling a more comprehensive evaluation of multimodal reasoning. The authors demonstrate that post-training on aligned image-text pairs improves performance on experimental interpretation tasks by 10–25% and text-only reasoning tasks by 5–16%. Additionally, these gains generalize to other scientific domains, as evidenced by improvements on ScienceQA and PubMedQA benchmarks. The MATRIX dataset and models are made publicly available to facilitate further research.
Methodology
The authors constructed the MATRIX benchmark using data from postgraduate-level materials science coursework and open-access research papers. Tasks are divided into four categories: foundational theory reasoning, research-level reasoning, hypothesis generation, and experimental interpretation. The benchmark includes both text-based and multimodal tasks, with experimental imagery paired with textual descriptions. Post-training experiments were conducted using aligned image-text pairs to evaluate the impact of multimodal supervision on reasoning performance. The framework was also tested on out-of-domain benchmarks like ScienceQA and PubMedQA to assess generalizability.
Results
Post-training on aligned image-text pairs led to significant improvements in experimental interpretation tasks (10–25%) and text-only reasoning tasks (5–16%). The approach also demonstrated consistent gains on out-of-domain benchmarks, including ScienceQA and PubMedQA, highlighting the generalizability of the multimodal post-training framework. These results underscore the importance of cross-modal representational transfer and the role of visual grounding in enhancing scientific reasoning.
Implications
The MATRIX benchmark and post-training framework have the potential to advance the development of multimodal foundation models for scientific reasoning. By integrating experimental imagery with textual data, MATRIX enables more comprehensive evaluations of scientific tasks, particularly in materials science. The demonstrated generalizability to other domains suggests that this approach could be applied to a wide range of scientific disciplines, improving hypothesis generation, experimental interpretation, and theory-driven reasoning in multimodal contexts.
View on arXiv

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Navdeep Kumar, Tehila Dahan, Lior Cohen, Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
  • The paper achieves an optimal sample complexity of O(ϵ−2) for single-timescale actor-critic algorithms, closing the gap between the theoretical lower bound and previous state-of-the-art results.
  • A combination of STORM (Stochastic Recursive Momentum) and a replay buffer is introduced to effectively reduce variance in critic updates, addressing challenges from nonstationary occupancy measures.
  • The methodology is compatible with existing deep learning architectures and requires only minor modifications, ensuring practical applicability.
  • The theoretical analysis involves advanced techniques such as Lyapunov functions and ODE-tracking frameworks to handle the interdependent dynamics of actor and critic updates.
  • The results provide significant improvements in sample efficiency for reinforcement learning algorithms, making them more suitable for large-scale applications.
Read More
Abstract
This paper establishes an optimal sample complexity of O(ϵ−2) for achieving an ϵ-optimal global policy using a single-timescale actor-critic algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces. This improves upon the previous state-of-the-art sample complexity of O(ϵ−3). The authors introduce a novel combination of STORM (Stochastic Recursive Momentum) for variance reduction in critic updates and a replay buffer mechanism to address challenges arising from nonstationary occupancy measures. The proposed approach is compatible with existing deep learning architectures and requires minimal modifications, making it practical for real-world reinforcement learning applications. The paper also provides a detailed theoretical analysis using Lyapunov functions and extends the ODE-tracking framework to demonstrate the optimal convergence rate.
Methodology
The authors use a single-timescale actor-critic framework where both actor and critic updates occur at similar rates. They incorporate STORM for variance reduction in critic updates and introduce a replay buffer to manage variance arising from nonstationary occupancy measures. The analysis employs time-dependent learning rates for actor, critic, and momentum updates, along with Lyapunov functions and an extended ODE-tracking framework to establish convergence properties.
Results
The paper demonstrates that the proposed single-timescale actor-critic algorithm achieves an optimal sample complexity of O(ϵ−2) for computing an ϵ-global optimal policy. This represents a significant improvement over the previous best complexity of O(ϵ−3). The combination of STORM and replay buffer effectively reduces variance in critic updates, ensuring faster convergence.
Implications
The findings have important implications for reinforcement learning, particularly in improving the efficiency of actor-critic algorithms for large-scale problems. The compatibility with deep learning architectures makes the approach suitable for practical applications in areas such as robotics, autonomous systems, and game AI. Additionally, the theoretical advancements may inspire further research into single-timescale frameworks and variance reduction techniques.
View on arXiv

Parallel Stochastic Gradient-Based Planning for World Models

Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar
  • GRASP introduces parallel optimization of intermediate states ('virtual states') with soft dynamics constraints, enabling efficient long-horizon planning.
  • The planner mitigates gradient sensitivity issues in high-dimensional visual world models by stopping gradients through state inputs and focusing on action-input gradients.
  • Stochastic updates via Langevin-style noise promote exploration and help escape local minima during optimization.
  • GRASP achieves up to +10% higher success rates compared to existing methods like CEM and GD, with less than half the computational cost.
  • The approach is validated on visual world models trained on D4RL and DeepMind control suite benchmarks.
Read More
Abstract
This paper introduces GRASP (Gradient RelAxed Stochastic Planner), a novel gradient-based planning method for learned world models that addresses challenges in long-horizon control tasks from visual inputs. World models simulate environment dynamics using raw sensory data, but planning with them is often hindered by high-dimensional state spaces and local minima. GRASP leverages differentiability in world models while introducing stochasticity and parallel optimization of intermediate states, termed 'virtual states,' to improve robustness and computational efficiency. The method avoids sensitive gradients through state inputs by focusing optimization on action-input gradients and incorporates Langevin-style stochastic updates to facilitate exploration and escape from unfavorable basins. GRASP outperforms traditional planning algorithms like the Cross-Entropy Method (CEM) and vanilla gradient descent (GD) in terms of success rate and computational efficiency, demonstrating its effectiveness in visual world models across various benchmarks.
Methodology
The authors propose a gradient-based planner that decouples temporal dynamics into parallel-optimized intermediate states ('virtual states') rather than relying on sequential rollouts. The method incorporates stochastic updates to intermediate states and modifies gradient structures to focus optimization on action-input gradients. A dense one-step goal loss is applied across the trajectory to ensure convergence towards the target state. The planner intermittently applies gradient descent steps to refine stochastically optimized trajectories.
Results
GRASP demonstrates superior performance in long-horizon planning tasks, achieving up to +10% higher success rates compared to CEM and GD while reducing computational costs by more than half. Experiments on visual world models trained on D4RL and DeepMind control suite benchmarks validate its robustness and efficiency.
Implications
GRASP has significant implications for real-world applications requiring robust and efficient planning in high-dimensional environments, such as robotics, autonomous navigation, and simulation-based testing in medical procedures or complex systems. Its ability to handle visual inputs and long-horizon tasks makes it particularly suitable for tasks involving learned world models in dynamic and visually complex domains.
View on arXiv

PyGALAX: An Open-Source Python Toolkit for Advanced Explainable Geospatial Machine Learning

Pingping Wang, Yihong Yuan, Lingcheng Li, Yongmei Lu
  • PyGALAX integrates AutoML and SHAP-based XAI to analyze spatial heterogeneity in regression and classification tasks.
  • It improves upon the GALAX framework by adding automatic bandwidth selection and flexible kernel functions.
  • The toolkit supports both fixed and adaptive bandwidth approaches, enabling optimal spatial scale selection for diverse datasets.
  • PyGALAX is designed for accessibility and reproducibility, making advanced geospatial machine learning methods available to a broader research community.
  • It provides parallel processing capabilities for efficient analysis of large spatial datasets.
Read More
Abstract
PyGALAX is an open-source Python toolkit designed to enhance geospatial machine learning by integrating automated machine learning (AutoML) and explainable artificial intelligence (XAI) techniques. It builds upon the GALAX framework, providing tools for analyzing spatial heterogeneity in both regression and classification tasks. PyGALAX introduces key improvements such as automatic bandwidth selection, flexible kernel function options, and SHAP-based explainability, enabling researchers to model complex, non-linear spatial relationships while maintaining interpretability. The toolkit is designed to address limitations in traditional geographically weighted regression (GWR) methods and existing geospatial machine learning tools, which often lack flexibility and transparency. PyGALAX supports diverse applications, including urban analytics, environmental monitoring, and public health, by offering a user-friendly, reproducible, and scalable solution for spatial analysis.
Methodology
PyGALAX automates the GALAX framework by integrating geographically weighted AutoML, which optimizes machine learning models (e.g., Random Forest, XGBoost) for each spatial location. It incorporates SHAP-based explainability to interpret model predictions and spatial patterns. The toolkit includes features such as automatic bandwidth selection using Incremental Spatial Autocorrelation (ISA) or performance-based optimization, flexible kernel functions for spatial weighting, and support for both regression and classification tasks. Parallel processing is also implemented for handling large datasets efficiently.
Results
PyGALAX operationalizes the GALAX framework as a Python package, enhancing its usability and flexibility. It enables researchers to model spatially varying relationships with improved accuracy and interpretability compared to traditional GWR methods. The toolkit's ability to handle both regression and classification tasks expands its applicability to a wide range of geospatial research areas. By automating bandwidth and kernel selection, PyGALAX ensures robust and reproducible spatial modeling workflows.
Implications
PyGALAX has significant implications for fields such as geography, urban planning, environmental science, and public health. Its ability to model complex spatial relationships with transparency and adaptability makes it a valuable tool for evidence-based decision-making. The toolkit's open-source nature and user-friendly design lower the barrier to entry for advanced geospatial machine learning, enabling broader adoption and fostering innovation in spatial analysis.
View on arXiv

Semi-supervised CAPP Transformer Learning via Pseudo-labeling

Dennis Gross, Helge Spieker, Arnaud Gotlieb, Emmanuel Stathatos, Panorios Benardos, George-Christopher Vosniakos
  • The paper proposes a semi-supervised learning approach for high-level CAPP using pseudo-labeling to address data scarcity.
  • An oracle is trained to validate transformer predictions and selectively augment the training dataset with correct sequences.
  • The methodology improves transformer generalization without requiring manual labeling during test time.
  • Experiments on simulated datasets show consistent accuracy gains compared to baseline and random augmentation methods.
  • The approach is designed to be practical for real-world manufacturing environments with limited labeled data.
Read More
Abstract
This paper addresses the challenge of limited labeled data availability in high-level Computer-Aided Process Planning (CAPP) by proposing a semi-supervised learning approach using pseudo-labeling. High-level CAPP involves generating manufacturing process plans based on part specifications, which traditionally relied on expert knowledge and rule-based systems. Recent advancements have applied transformer models to this task, but their performance suffers in data-scarce environments. The authors introduce a method where an oracle, trained on labeled data and the transformer’s behavior, evaluates the correctness of predictions on unseen parts. Correct predictions are selectively added to the training set for one-shot retraining, improving the model's generalization. Experiments on simulated datasets demonstrate consistent accuracy improvements over baseline methods, showcasing the effectiveness of this approach in low-resource manufacturing scenarios. The proposed methodology reduces manual labeling efforts and mitigates risks associated with incorrect pseudo-labels, making it practical for industrial applications.
Methodology
The authors use a GPT-2 style transformer model trained on labeled CAPP data to generate manufacturing process plans. A learned oracle, formulated as a binary classifier, evaluates the correctness of predictions based on features extracted from the transformer’s outputs, such as confidence patterns, uncertainty measures, and temporal dynamics. Correct predictions are added to the training set for one-shot retraining, while incorrect predictions can be flagged for manual inspection. The oracle ensures selective pseudo-labeling to improve model performance without reinforcing errors.
Results
Experiments conducted on small-scale datasets with simulated ground truth demonstrate consistent accuracy improvements over baseline methods. The selective pseudo-labeling approach outperforms random augmentation and enhances the generalization capabilities of the CAPP transformer in data-scarce scenarios.
Implications
This methodology has significant implications for manufacturing industries, enabling the deployment of high-level CAPP systems in environments with limited labeled data. It reduces reliance on manual labeling, improves model accuracy, and offers a scalable solution for generating manufacturing process plans across diverse production contexts. The approach could also be extended to other sequence prediction tasks in low-resource domains.
View on arXiv

Single-Edge Node Injection Threats to GNN-Based Security Monitoring in Industrial Graph Systems

Wenjie Liang, Ranhui Yan, Jia Cai, You-Gan Wang
  • The paper formalizes a resource-constrained node injection threat model tailored to industrial graph-based monitoring systems such as IIoT, CPS, and smart grids.
  • SEGIA introduces a stealth-aware single-edge injection framework that combines pruning-aware optimization, reverse feature synthesis, and similarity regularization.
  • Experimental results show SEGIA achieves at least 25% higher attack success rates compared to baseline methods under smaller edge budgets.
  • The study highlights residual risks in industrial GNN deployments despite homophily-oriented sanitization defenses.
  • The authors emphasize the importance of lightweight admission validation and neighborhood-consistency monitoring to enhance system-level security.
Read More
Abstract
This paper investigates the vulnerability of graph neural networks (GNNs) used in industrial monitoring systems to node injection attacks under constrained resources. The authors propose the Single-Edge Graph Injection Attack (SEGIA), a novel attack framework where each injected node connects to the graph through a single edge, minimizing detectability while maximizing impact. SEGIA leverages a pruning-aware surrogate model, multi-hop neighborhood sampling, and reverse graph convolution-based feature synthesis with similarity regularization to evade homophily-based defenses and edge pruning mechanisms. The study demonstrates that SEGIA achieves significantly higher attack success rates compared to existing methods, even under strict edge budgets, highlighting critical risks in industrial GNN deployments. The findings underscore the need for lightweight validation mechanisms and neighborhood-consistency monitoring to mitigate such threats.
Methodology
The authors developed SEGIA, a single-edge node injection attack framework, which integrates a pruned Simple Graph Convolution (PrSGC) surrogate model, multi-hop neighborhood sampling for optimization under partial graph knowledge, and reverse graph convolution-based feature synthesis with similarity regularization. This approach preserves local homophily and minimizes detectability while anticipating edge-pruning defenses.
Results
SEGIA demonstrated at least 25% higher attack success rates compared to baseline methods across multiple datasets and defense mechanisms, even under constrained edge budgets. The attack effectively induced misclassification and operational risks in industrial graph systems while evading detection.
Implications
The findings reveal critical vulnerabilities in industrial GNN-based monitoring systems, emphasizing the need for improved security measures such as lightweight admission validation and neighborhood-consistency monitoring. These insights are particularly relevant for IIoT, CPS, and smart-grid environments, where system-level risks can translate into operational failures or safety hazards.
View on arXiv

Test-time Generalization for Physics through Neural Operator Splitting

Louis Serrano, Jiequn Han, Edouard Oyallon, Shirley Ho, Rudy Morel
  • Introduces a novel test-time generalization strategy for neural operators using operator splitting, enabling zero-shot generalization to unseen PDE dynamics.
  • Leverages a pretrained DISCO framework and a beam search algorithm to compose known operators for approximating OOD dynamics without modifying model weights.
  • Demonstrates state-of-the-art performance on challenging OOD tasks, including parameter extrapolation and novel combinations of physical effects.
  • Enables system identification and zero-shot PDE parameter estimation by decomposing unknown dynamics into compositions of known operators.
  • Highlights test-time computation as a key avenue for building flexible and generalizable neural operators.
Read More
Abstract
This paper addresses the challenge of out-of-distribution (OOD) generalization in neural operators for solving partial differential equations (PDEs). Neural operators, while effective at learning solution maps for PDEs, often fail to generalize to unseen dynamics, such as novel initial conditions or new combinations of physical effects. The authors propose a novel test-time generalization method that does not require modifying pretrained model weights. The approach builds on the DISCO framework, which encodes neural operators into a shared latent space, and introduces a neural operator splitting strategy. At test time, the method uses a beam search to compose pretrained operators to approximate unseen dynamics. This enables zero-shot generalization to OOD scenarios, such as parameter extrapolation and novel combinations of physical phenomena. The method also facilitates system identification by expressing unknown dynamics as compositions of known operators. The proposed framework achieves state-of-the-art results in zero-shot generalization tasks, outperforming existing adaptive neural operator and transformer-based methods.
Methodology
The proposed method builds on the DISCO framework, which trains neural operators across different dynamics and encodes them into a shared latent space. At test time, the method employs a beam search to identify compositions of pretrained operators that approximate the unseen dynamics. Operator splitting is used during both the search and rollout phases to approximate the sum of physical terms through successive compositions. This approach avoids modifying pretrained weights and adapts dynamically based on the proximity of test dynamics to the training distribution.
Results
The method achieves state-of-the-art zero-shot generalization on two challenging OOD scenarios: parameter extrapolation and novel combinations of physical phenomena. It outperforms existing adaptive neural operator methods and transformer-based architectures. Additionally, the framework enables accurate system identification and parameter estimation for unseen PDEs, demonstrating its robustness and flexibility in diverse nonlinear PDE tasks.
Implications
This work has significant implications for advancing the generalization capabilities of neural operators in scientific computing and physics-based simulations. The ability to generalize to unseen dynamics without requiring additional training data or fine-tuning makes the approach particularly valuable for applications where data is scarce or expensive to obtain. Potential applications include climate modeling, fluid dynamics, and other domains where PDEs play a critical role in understanding complex systems.
View on arXiv