AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
53
Papers today
8h
Update frequency
7
Days of history
CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry
NLP
Large Language Models
Efficient ML
- Introduction of Gradient Fan-in Asymmetry (GFA) as a structural explanation for layer redundancy in deep transformers.
- Development of CascadeFormer, which tapers model width with depth to improve efficiency without sacrificing performance.
- CascadeFlow Pruning (CFP) leverages training gradients for effective layer pruning, outperforming standard methods.
- Empirical validation of GFA through correlational and interventional studies on models up to 1.2B parameters.
Read more
CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry
Summary
The paper introduces CascadeFormer, a novel transformer architecture that addresses the inefficiencies of deep transformers, which often exhibit diminishing returns in deeper layers. The authors propose the concept of Gradient Fan-in Asymmetry (GFA) to explain why deeper layers contribute less to model performance, attributing this to a structural bottleneck in gradient diversity rather than just gradient magnitude. CascadeFormer adapts the width of the model according to depth, aligning with the natural flow of information, which leads to improved efficiency. Additionally, the authors present CascadeFlow Pruning (CFP), a method that prunes layers based on accumulated training gradients, outperforming traditional heuristics without requiring extensive post hoc analysis. Empirical evidence supports the GFA hypothesis, demonstrating that deeper layers receive less informative gradients, which can be mitigated by structural adjustments in the model. The findings indicate that optimizing model architecture based on gradient flow can enhance performance and efficiency in large-scale models.
Methodology
The authors conducted empirical studies on transformer models trained from scratch, analyzing the relationship between per-layer gradient norms and functional importance. They performed two types of interventions: an ablative test that manipulated gradient norms and a constructive test that increased path counts through layer repetition. These methods were used to validate the GFA hypothesis and to develop the CascadeFormer architecture and CascadeFlow Pruning technique.
Results
CascadeFormer achieved comparable perplexity to a uniform baseline while reducing latency by 8.6% and increasing throughput by 9.4%. CascadeFlow Pruning demonstrated superior performance in pruning efficiency, outperforming traditional heuristics in terms of perplexity and maintaining competitive downstream accuracy.
Implications
The findings suggest that optimizing transformer architectures based on gradient flow can lead to more efficient models, which is particularly relevant for large-scale applications in NLP and other domains. The methods proposed could be applied to enhance the performance of existing models and inform future architectural designs.
Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs
Theory
Efficient ML
Graph Learning
- KANs adapt activation functions, offering a new approach to neural network architecture.
- In aerodynamic prediction tasks, KANs show comparable performance to MLPs but are outperformed by GNNs.
- KANs have faster training times due to lower complexity but suffer from training instabilities.
- Hyperparameter optimization is crucial for improving KAN performance.
Read more
Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs
Summary
This paper investigates the performance of Kolmogorov Arnold Networks (KANs) in the context of aerodynamic prediction, specifically in predicting surface pressure distributions over subsonic and transonic airfoils. KANs are a novel neural network architecture that adapts activation functions rather than the coefficients of affine transformations, as seen in traditional architectures like multilayer perceptrons (MLPs). The authors compare KANs against MLPs and graph neural networks (GNNs) to evaluate their effectiveness in surrogate modeling for fluid dynamics. Results indicate that while KANs perform well, they are marginally inferior to MLPs and significantly less effective than GNNs, which achieved the best performance. KANs exhibit faster training times due to their lower complexity but suffer from training instabilities and require careful hyperparameter optimization. This study contributes to the ongoing debate regarding the supremacy of KANs over MLPs and highlights their potential and limitations in practical applications.
Methodology
The authors developed a KAN-based surrogate model for predicting pressure coefficients over airfoils and compared its performance with MLPs and GNNs. They replicated previous results for MLPs and GNNs on the same task and evaluated the models based on their ability to interpolate across varying physical parameters and flight conditions.
Results
The study found that all three models (KANs, MLPs, and GNNs) achieved good results in predicting pressure coefficients, with GNNs performing the best. KANs ranked third, showing marginally inferior performance compared to MLPs. While KANs trained faster, they exhibited training instabilities and poorer generalization in regions with strong gradients.
Implications
The findings suggest that while KANs are promising for certain applications in fluid dynamics, their limitations in stability and performance compared to MLPs and GNNs must be addressed. This research could influence the development of more efficient surrogate models in computational fluid dynamics and other engineering applications.
RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
NLP
Large Language Models
- Introduction of the RSPC dataset linking psychiatric conditions with relational stressors.
- Benchmarking of transformer models and LLMs for mental health tasks.
- Identification of distinct model capabilities based on task requirements.
- Strong associations found between anxiety disorders and relational uncertainty.
Read more
RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
Summary
The paper introduces the Relational Stress and Psychiatry Corpus (RSPC), a novel dataset aimed at understanding mental health conditions within the context of digitally mediated relationships, specifically long-distance relationships (LDRs). Unlike previous studies that often treat mental health as an isolated phenomenon, this research emphasizes the relational context by analyzing 1,799 Reddit posts annotated by psychiatrists for various psychiatric conditions, relational stressors, and relationship phases. The authors benchmark seven fine-tuned transformer models and five large language models (LLMs) on tasks such as multi-label disorder classification, relational trigger detection, and temporal phase prediction. The findings reveal significant differences in model performance across tasks, with Claude-3-Haiku achieving the highest disorder classification performance and GPT-4o excelling in relational trigger detection. The study also uncovers strong correlations between anxiety disorders and chronic relational uncertainty, highlighting the importance of considering relational dynamics in mental health modeling. Overall, RSPC sets a new standard for NLP tasks that incorporate relational context, advocating for a shift from individual-centric to context-aware mental health assessments.
Methodology
The study employs a clinically grounded annotation framework developed in collaboration with licensed psychiatrists. Each Reddit post was annotated for diagnostic categories aligned with DSM-5-TR and ICD-11 criteria. The authors benchmarked multiple transformer models and LLMs across various tasks, including multi-label classification and relational trigger detection, to assess model performance.
Results
The results indicate that Claude-3-Haiku achieved the best performance in disorder classification (Macro-F1 = 0.538), while GPT-4o excelled in relational trigger detection (Macro-F1 = 0.519). The analysis also revealed a significant link between anxiety disorders and chronic relational uncertainty, underscoring the relevance of relational contexts in psychiatric inference.
Implications
The RSPC dataset and findings have significant implications for mental health research, suggesting that incorporating relational dynamics can enhance the understanding and modeling of psychiatric conditions. This approach may lead to more effective interventions and support systems for individuals in digitally mediated relationships.
Physics-guided Convolutional Neural Network for Domain Growth Prediction in Systems with Conserved Kinetics
Theory
Efficient ML
- Introduction of an attention-based, physics-guided CNN for modeling phase separation dynamics.
- The model incorporates conservation constraints to ensure physical consistency during long-term predictions.
- Demonstrates accurate predictions of domain growth and preservation of mixture composition.
- Validates the model against known growth laws, confirming its effectiveness for critical and off-critical mixtures.
Read more
Physics-guided Convolutional Neural Network for Domain Growth Prediction in Systems with Conserved Kinetics
Summary
This paper presents a novel approach to modeling the spatiotemporal evolution of phase separation in binary mixtures governed by the Cahn–Hilliard equation using an attention-based, physics-guided convolutional neural network (CNN). The authors highlight the limitations of traditional numerical solvers for nonlinear partial differential equations (PDEs) and propose their model as an efficient surrogate that maintains physical consistency. The model incorporates a conservation constraint directly into the loss function and utilizes an attention mechanism to capture global patterns in the evolving microstructure. The trained model demonstrates stability and accuracy over long-time predictions for both critical and off-critical mixtures, effectively preserving the mixture composition and accurately capturing domain growth consistent with the Lifshitz–Slyozov law. The results indicate that the proposed framework can be extended to other complex dynamical systems, showcasing its potential for modeling systems with conserved kinetics.
Methodology
The authors developed a physics-guided convolutional neural network inspired by the residual U-Net architecture. The model integrates a conservation constraint into the loss function and employs an attention mechanism to enhance the learning of global patterns in the microstructure evolution. The training process involves generating datasets based on the Cahn–Hilliard equation and evaluating the model's performance on both training and validation datasets.
Results
The proposed model successfully predicts the long-term evolution of phase separation in binary mixtures, maintaining stability and accuracy. It captures the growth of domain size consistent with the Lifshitz–Slyozov domain-growth law and preserves the order parameter throughout the evolution, demonstrating its effectiveness as a surrogate model for systems with conserved kinetics.
Implications
This work has significant implications for the modeling of complex dynamical systems in various scientific fields, including materials science and biology. The ability to accurately predict phase separation dynamics can enhance the understanding of material properties and processes, potentially leading to advancements in the design of new materials and the optimization of industrial processes.
Blackwell Approachability and Gradient Equilibrium are Equivalent
Theory
Optimization
- GEQ is algorithmically equivalent to Blackwell Approachability, allowing for the use of GEQ oracles in BA problems.
- The paper identifies necessary and sufficient conditions for achieving GEQ, expanding its theoretical foundation.
- Efficient reductions between GEQ and other frameworks like regret minimization and calibration are established.
- The results imply that GEQ algorithms are as powerful as classical regret minimization and calibration algorithms.
Read more
Blackwell Approachability and Gradient Equilibrium are Equivalent
Summary
This paper establishes a fundamental equivalence between Gradient Equilibrium (GEQ) and Blackwell Approachability (BA) in the context of online optimization. GEQ is a novel framework that generalizes first-order stationarity from offline optimization and addresses problems such as online conformal prediction. The authors demonstrate that any BA problem can be solved using a GEQ oracle without significant loss in error rate, and vice versa. This equivalence clarifies the relationship between GEQ and other online learning frameworks, including regret minimization and calibration. The paper also identifies necessary and sufficient conditions for achieving GEQ and presents efficient reductions between various GEQ formulations with different decision set constraints. By linking GEQ to established frameworks, the authors provide a pathway for transferring guarantees from regret minimization to GEQ, enhancing the understanding of these concepts in the online learning landscape.
Methodology
The authors employed black-box oracle reductions to demonstrate the equivalence between GEQ and BA. They analyzed the structural connections between these frameworks and established necessary and sufficient conditions for GEQ. The methodology included theoretical proofs and algorithmic constructions that facilitate the transfer of guarantees from one framework to another.
Results
The main results include the establishment of a black-box oracle reduction that allows any algorithm solving BA problems to be adapted for GEQ problems without asymptotic loss in error rate. The authors also clarified the connections between GEQ, regret minimization, and calibration, showing that GEQ can achieve similar guarantees as these established frameworks.
Implications
The equivalence between GEQ and BA suggests that techniques and algorithms developed for one framework can be effectively applied to the other, potentially leading to improved algorithms for online optimization problems. This work may influence future research in online learning, particularly in applications requiring robust decision-making against adaptive adversaries.
Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space
Optimization
- Introduction of BOBA, a bandit-guided surrogate optimization framework.
- Elimination of full-library inference by adaptive computation allocation.
- Demonstration of the importance of uncertainty-aware bandit strategies.
- Establishment of a tunable tradeoff between optimization performance and inference cost.
Read more
Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space
Summary
The paper addresses the challenge of efficiently identifying high-utility candidates in large discrete spaces, particularly in the context of drug discovery, where the evaluation of molecular candidates is costly. The authors introduce BOBA (Bayesian Optimization with BAndits), a framework that optimizes surrogate models without the need for full-library inference. BOBA employs a multi-armed bandit approach to allocate computational resources adaptively across partitions of the action space, focusing on promising areas while maintaining exploration. The study highlights the importance of uncertainty-aware bandit strategies and meaningful partitioning of the action space. Experiments demonstrate that BOBA achieves a favorable tradeoff between optimization performance and inference cost, enabling effective candidate selection from ultra-large libraries. The findings suggest that as library sizes increase, the efficiency of BOBA becomes more pronounced, paving the way for scalable virtual screening in drug discovery.
Methodology
The methodology involves a combination of structure-aware partitioning of the action space, bandit-based allocation of computational resources across these partitions, and surrogate-guided optimization within each partition. The approach decouples global and local search to enhance efficiency in candidate selection.
Results
The empirical results indicate that BOBA significantly reduces inference costs while retaining optimization performance. The framework was tested on real-world drug discovery data, showing that the tradeoff between optimization performance and inference cost becomes increasingly favorable as the size of the molecular library grows. The number of partitions was identified as a key parameter influencing the balance between inference savings and bandit regret.
Implications
The findings have significant implications for drug discovery, particularly in the context of virtual screening of ultra-large molecular libraries. BOBA provides a practical solution for efficiently navigating the vast chemical space, potentially accelerating the identification of therapeutic leads.
Quantization in Federated Learning: Methods, Challenges and Future Directions
Federated Learning
Efficient ML
- Introduces a novel taxonomy of quantization methods specific to Federated Learning.
- Analyzes the interaction between quantization and core FL behaviors such as client drift and convergence stability.
- Identifies open research gaps and provides design guidelines for deploying quantized FL.
- Establishes quantization as a critical factor affecting the efficiency and robustness of FL systems.
Read more
Quantization in Federated Learning: Methods, Challenges and Future Directions
Summary
This paper presents a comprehensive survey on quantization techniques in Federated Learning (FL), addressing the critical challenges of communication efficiency, device heterogeneity, and non-IID data training. The authors introduce a novel taxonomy of quantization methods tailored for FL, focusing on dimensions such as client heterogeneity, aggregation consistency, and privacy integration. The paper analyzes the interplay between quantization and key FL behaviors, including client drift and convergence stability, while identifying research gaps and providing design guidelines for practitioners. The findings underscore the importance of quantization as a fundamental component that influences the performance and practicality of FL systems, particularly in resource-constrained environments like mobile and IoT devices.
Methodology
The authors conducted a systematic review of existing quantization techniques in the context of Federated Learning, categorizing them based on various FL-specific dimensions. They analyzed the implications of these techniques on communication efficiency, model accuracy, and system robustness, while also reviewing the literature to identify gaps and propose future research directions.
Results
The survey reveals that quantization significantly enhances communication efficiency in FL by reducing the payload size of model updates. However, it also introduces challenges such as potential accuracy degradation and instability in convergence, particularly under non-IID conditions. The authors provide insights into how different quantization strategies can be optimized for various FL scenarios.
Implications
The findings of this paper have significant implications for the design and implementation of Federated Learning systems, particularly in mobile and IoT environments where communication resources are limited. By understanding the trade-offs associated with quantization, practitioners can better deploy FL models that maintain privacy while optimizing performance and resource usage.
Topology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery
Computer Vision
Time Series
Interpretability
- Introduces Topological Data Analysis (TDA) for flood detection, enhancing interpretability and robustness.
- Utilizes the SEN12-FLOOD dataset, which includes both optical and SAR imagery for comprehensive flood monitoring.
- Achieves a notable accuracy improvement in flood detection by combining topological features with traditional neural network architectures.
- Demonstrates the effectiveness of transfer learning and a lightweight Gaussian topological embedding in improving model performance.
Read more
Topology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery
Summary
This paper addresses the challenge of flood detection using optical and Synthetic Aperture Radar (SAR) imagery, particularly in the context of cloud cover that obscures optical data. The authors leverage the SEN12-FLOOD dataset, which provides coregistered time series data from Sentinel-1 SAR and Sentinel-2 multispectral imagery. They introduce a novel approach that incorporates Topological Data Analysis (TDA) into neural networks to enhance flood detection accuracy. By extracting topological features from images and integrating them with traditional convolutional neural networks (CNNs) and gated recurrent units (GRUs), the authors demonstrate that these topological descriptors provide meaningful flood signals and improve interpretability. The study shows that combining topological and convolutional features leads to a significant increase in detection accuracy, achieving 98.9% compared to a baseline of 95.7%. This work highlights the potential of TDA in remote sensing applications, particularly in safety-critical scenarios where understanding model decisions is crucial.
Methodology
The authors systematically evaluate topological descriptors for flood detection by extracting features from the SEN12-FLOOD dataset. They improve upon existing GRU baselines by incorporating transfer learning and propose a lightweight Gaussian topological embedding. The topological features are integrated with convolutional features to enhance the neural network's performance.
Results
The integration of topological features with convolutional neural networks and gated recurrent units resulted in an accuracy of 98.9% for flood detection, significantly higher than the baseline accuracy of 95.7%. This demonstrates the effectiveness of combining local and global feature representations in improving flood detection systems.
Implications
The findings suggest that incorporating topological data analysis into machine learning models can enhance the interpretability and robustness of flood detection systems, which is critical for emergency response and disaster management. This approach may be applicable to other domains requiring complex data analysis and decision-making.
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Reinforcement Learning
Large Language Models
Optimization
- EVOM automates the design of actor-critic architectures, addressing high evaluation costs and open-ended design challenges.
- The framework utilizes a bi-level optimization approach with an inner loop for weight training and an outer loop for architecture evolution.
- An LLM-based design agent generates and refines architecture programs, enhancing the search process.
- Experimental results show EVOM outperforms existing methods, including manual designs and LLM-guided searches.
Read more
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Summary
The paper introduces EVOM, an innovative framework for automating the design of actor-critic architectures in reinforcement learning (RL). Traditional methods rely on manually designed architectures, which can be inefficient and suboptimal. EVOM addresses the challenges of high evaluation costs and the open-ended nature of architecture design by framing the architecture search as a bi-level optimization problem. The inner loop employs a low-fidelity proximal policy optimization (PPO) for training weights, while the outer loop utilizes a large language model (LLM)-based design agent to iteratively refine architecture programs. This approach allows for the generation of diverse and effective architectures without being constrained by predefined structures. Experimental results demonstrate that EVOM significantly outperforms manually designed architectures, LLM-guided random searches, and the state-of-the-art MLES method on benchmark environments such as Ant-v4 and HalfCheetah-v4. Ablation studies confirm the necessity of both the meta-evolution loop and the LLM design agent for achieving superior performance.
Methodology
The methodology involves a bi-level optimization framework where the inner loop uses low-fidelity PPO for training candidate architectures, while the outer loop employs an LLM-based design agent to evolve architecture programs through mutation and crossover. This dual-loop structure allows for efficient exploration of the architecture space.
Results
EVOM demonstrated superior performance compared to manually designed architectures and other baseline methods on the Ant-v4 and HalfCheetah-v4 environments. The framework's design led to improved learning stability and final performance, as confirmed by ablation studies highlighting the contributions of the meta-evolution loop and the LLM design agent.
Implications
The findings suggest that automated architecture design can significantly enhance the efficiency and effectiveness of reinforcement learning systems. EVOM's approach could be applied to other areas of machine learning where architecture optimization is critical, potentially leading to more robust and adaptable models.
fTNN: a tensor neural network for fractional PDEs
Theory
Optimization
Efficient ML
- Introduction of fTNN, a tensor neural network for solving fractional PDEs.
- Development of a deterministic integration framework for the fractional Laplacian.
- Construction of boundary-singularity-aware trial functions for better accuracy.
- Design of a spatiotemporally separable neural network for time-dependent PDEs.
Read more
fTNN: a tensor neural network for fractional PDEs
Summary
The paper introduces fTNN, a deterministic tensor neural network method designed to solve fractional partial differential equations (PDEs) involving the fractional Laplacian on bounded domains. The authors focus on the fractional Poisson equation and time-dependent fractional advection-diffusion equations. The fTNN method employs a geometry-adapted integration split that divides the fractional Laplacian into three parts: a singular near field, a regular interior far field, and an analytical exterior far field. Each part is treated with specific quadrature techniques, resulting in a fully deterministic integration framework. To address low-regularity solutions, the authors construct boundary-singularity-aware trial functions and propose strategies for selecting leading exponents and evaluating loss functions based on the singularity structure. For time-dependent PDEs, a spatiotemporally separable neural network is designed to factorize the residual into low-dimensional integrals, integrated with an alternating neural network subspace optimization strategy for efficient training. Numerical experiments demonstrate that fTNN achieves high accuracy on benchmark problems, outperforming existing methods like fPINN and Monte Carlo approaches, especially in scenarios with strong boundary singularities and long-time simulations.
Methodology
The fTNN method employs a geometry-adapted integration split to decompose the fractional Laplacian into three contributions, using Gauss-Jacobi quadrature for singular integrals, Gauss quadrature for regular integrals, and deterministic angular quadrature for angular variables. Boundary-singularity-aware trial functions are constructed, and a spatiotemporally separable neural network is designed for time-dependent problems, integrated with an alternating optimization strategy.
Results
The numerical experiments indicate that the fTNN framework achieves high accuracy across various benchmarks, significantly improving upon existing methods such as fPINN and Monte Carlo baselines, particularly in cases with strong boundary singularities and during long-time simulations.
Implications
The fTNN method provides a robust framework for solving complex fractional PDEs, which could have applications in fields such as anomalous transport modeling, nonlocal diffusion processes, and other scientific computing scenarios where fractional derivatives are relevant.
Mesh-RL: Coupled subgrid reinforcement learning
Reinforcement Learning
- Mesh-RL introduces a spatial domain-decomposition framework for reinforcement learning.
- The framework enforces boundary-consistent TD updates for improved value propagation.
- Mesh-RL accelerates learning without modifying the reward function or requiring explicit planning.
- Empirical results show significant improvements in convergence speed and learning stability.
Read more
Mesh-RL: Coupled subgrid reinforcement learning
Summary
The paper introduces Mesh-RL, a novel reinforcement learning framework designed to address the challenges of slow temporal-difference (TD) reward propagation in large or sparse-reward environments. Inspired by finite element methods and domain decomposition theory, Mesh-RL partitions the environment into overlapping subgrids and enforces boundary-consistent TD updates. This approach allows for localized learning while ensuring coherent value propagation across the entire state space. Unlike hierarchical or model-based methods, Mesh-RL accelerates long-range credit assignment without altering the reward function or introducing explicit planning mechanisms. The authors evaluate Mesh-RL on hazard-dense grid-world environments with varying geometries and mesh resolutions, demonstrating that it consistently enhances convergence speed, cumulative reward, and learning stability across Q-learning, SARSA, and Dyna-Q algorithms. Higher mesh resolutions are shown to sustain exploration, prevent premature convergence, and significantly speed up value propagation to distant states. Overall, Mesh-RL presents a principled method for improving sample efficiency in sparse-reward settings by integrating boundary-consistency techniques from scientific computing into reinforcement learning.
Methodology
The methodology involves partitioning the environment into overlapping subgrids and implementing boundary-aware TD updates that propagate value information across these partitions. The authors conduct experiments on various grid-world environments to evaluate the performance of Mesh-RL compared to traditional TD learning methods.
Results
The results indicate that Mesh-RL consistently outperforms standard Q-learning, SARSA, and Dyna-Q in terms of convergence speed, cumulative reward, and overall learning stability. The framework's ability to maintain exploration and accelerate value propagation is particularly notable at higher mesh resolutions.
Implications
The implications of this work suggest that Mesh-RL can be effectively applied in environments where reward signals are sparse or delayed, potentially improving the efficiency of reinforcement learning algorithms in real-world applications such as robotics, game playing, and autonomous systems.
Discovering Millions of Interpretable Features with Sparse Autoencoders
NLP
Large Language Models
Interpretability
- Introduction of Qwen3-Instruct SAE, a comprehensive suite of Sparse Autoencoders for the Qwen3 model family.
- Layer-wise SAEs are trained at key activation sites to enhance interpretability.
- Evaluation reveals distinct sparsity-fidelity trade-offs across different layers.
- Demonstrated utility in a case study for steering model behavior.
Read more
Discovering Millions of Interpretable Features with Sparse Autoencoders
Summary
This paper presents Qwen3-Instruct SAE, a suite of Sparse Autoencoders (SAEs) designed to extract interpretable features from the Qwen3 instruction-tuned model family, which includes models of varying sizes (1.7B, 4B, and 8B parameters). The authors address the computational challenges associated with training SAEs and the limited availability of open-source models. They systematically train layer-wise SAEs at key activation sites (residual streams, MLP outputs, and attention outputs) for the smaller models and a subset of layers for the largest model. The evaluation of these SAEs is conducted through reconstruction and model recovery metrics, revealing a trade-off between sparsity and fidelity. A case study demonstrates the practical utility of the Qwen3-Instruct SAE in steering model behavior towards refusal responses, showcasing its potential for behavioral interventions and enhancing interpretability in large language models. The release of this comprehensive suite aims to facilitate further research in feature discovery and mechanistic interpretability within the AI community.
Methodology
The authors trained Sparse Autoencoders on the Qwen3 instruction-tuned model family, focusing on three key activation sites: residual streams, MLP outputs, and attention outputs. They employed layer-wise training for the smaller models and a subset of layers for the largest model, followed by systematic evaluation using reconstruction and model recovery metrics.
Results
The evaluation of Qwen3-Instruct SAE indicated distinct trade-offs between sparsity and fidelity across different layers and components. The case study demonstrated that selected SAE features could effectively steer the instruction-tuned Qwen3 models towards specific behaviors, such as refusal responses.
Implications
The findings suggest that Sparse Autoencoders can significantly enhance the interpretability of large language models, providing a foundation for future research into feature discovery and behavioral interventions. The release of Qwen3-Instruct SAE as an open-source resource will facilitate broader exploration of mechanistic interpretability in AI.
Algorithmic Foundations of Deep Learning: Complexity-Theoretic Rates and a Characterization of Universal Approximation
Theory
- Introduces a complexity-theoretic perspective on neural network expressivity.
- Establishes a circuit-to-neural-network compilation theorem linking computational complexity to neural network architecture.
- Characterizes universal approximation in feedforward neural networks based on the presence of non-affine nonlinearities.
- Demonstrates improved approximation rates for complex functions compared to classical approximation theory.
Read more
Algorithmic Foundations of Deep Learning: Complexity-Theoretic Rates and a Characterization of Universal Approximation
Summary
This paper addresses the limitations of classical neural network approximation theory, which primarily focuses on regularity and fails to distinguish between intuitively simple and complex functions. The authors propose a quantitative circuit-to-neural-network compilation theorem, emphasizing that neural networks should be viewed as models of computation rather than merely flexible basis functions. They demonstrate that if a function is computable by a real-valued circuit, it can also be approximated by a neural network with controlled parameters. Furthermore, the paper characterizes universal approximation in feedforward neural networks, showing that a model is universal if it includes a non-affine nonlinearity. This characterization extends beyond traditional multilayer perceptron settings. The authors illustrate their theory by providing universal approximation guarantees for various function classes and demonstrating that neural networks can emulate numerical algorithms without architecture-specific considerations. Their findings indicate that neural networks can achieve significantly improved approximation rates compared to classical methods, particularly in complex scenarios.
Methodology
The authors develop a theoretical framework that connects the complexity of functions computed by circuits to the architecture of neural networks. They analyze the expressivity of neural networks through a complexity-theoretic lens, focusing on the depth and nonlinearity of the networks. They also derive results for various function classes and approximation guarantees.
Results
The paper presents several key results, including universal approximation guarantees for continuous functions, minimax-optimal approximation rates for Besov classes, and logarithmic-error complexity for holomorphic functions. It also demonstrates that neural networks can emulate algorithms like Newton-Raphson root finding with significantly fewer parameters than traditional methods.
Implications
The findings suggest that neural networks can be more efficient in approximating complex functions than previously thought, which could lead to advancements in deep learning applications. The theoretical insights may also inform the design of more effective neural network architectures and training strategies.
A General Framework for Learning Algebraic Properties from Cayley Graphs using Graph Neural Networks
Graph Learning
- Development of a property-independent GNN framework for learning algebraic properties from Cayley graphs.
- Successful case studies on abelianity, nilpotency, and solvability of finite groups.
- Expanded dataset includes a broader range of finite groups for comprehensive evaluation.
- Demonstrates the potential of GNNs to extract significant algebraic information from graph representations.
Read more
A General Framework for Learning Algebraic Properties from Cayley Graphs using Graph Neural Networks
Summary
This paper presents a novel framework that extends the application of Graph Neural Networks (GNNs) to learn algebraic properties of finite groups from their Cayley graph representations. Building on previous work that focused on predicting the solvability of finite groups, the author generalizes this approach to create a property-independent methodology applicable to various algebraic classification tasks, specifically abelianity, nilpotency, and solvability. The framework employs a consistent GNN architecture and training pipeline, allowing for the investigation of how well algebraic structures can be inferred from graph representations. The study includes a significantly expanded dataset of finite groups, drawn from multiple families, and demonstrates that the proposed framework can effectively learn and distinguish between the specified algebraic properties. The findings indicate that substantial algebraic information is encoded in Cayley graphs and can be extracted using GNNs, thus providing a foundation for future research at the intersection of group theory, graph theory, and machine learning.
Methodology
The methodology involves a four-stage learning pipeline: group generation, Cayley graph construction, property labeling, and GNN modeling. Each finite group is represented by its Cayley graph, which serves as input to a GNN classifier. The framework maintains a consistent architecture and training process across different algebraic properties.
Results
The results show that the GNN framework successfully learns to distinguish between the algebraic properties of abelianity, nilpotency, and solvability from Cayley graphs. The expanded dataset allows for a more robust evaluation of the framework's effectiveness in recovering algebraic structures.
Implications
The proposed framework opens new avenues for applying graph representation learning to algebraic problems, potentially enhancing the understanding of group theory and its applications in various fields such as combinatorics and computer science.
Data-Free Reservoir Features for Efficient Long-Horizon Cold-Start Continual Learning
Computer Vision
Efficient ML
- CIRCLE introduces a data-free frozen-feature design for cold-start exemplar-free class-incremental learning.
- The model combines fixed random feature extraction with an ensemble of streaming linear discriminant analysis heads.
- CIRCLE shows superior performance at long task horizons (50-500 task splits) compared to traditional methods.
- The approach eliminates the need for replay, task-boundary information, and backbone backpropagation.
Read more
Data-Free Reservoir Features for Efficient Long-Horizon Cold-Start Continual Learning
Summary
This paper addresses the challenges of cold-start exemplar-free class-incremental learning (CS-EFCIL), where models must learn new classes without replaying past data or relying on pre-trained representations. Traditional methods either train the backbone throughout the learning stream, leading to computationally expensive drift compensation, or freeze the backbone after the first task, resulting in biased features. The authors propose a novel approach called CIRCLE, which utilizes a fixed bidirectional two-dimensional reservoir feature extractor and streaming linear discriminant analysis (SLDA) heads. CIRCLE avoids the pitfalls of traditional methods by never fitting the feature extractor to image data, thus eliminating the need for backpropagation and drift estimation. The model combines multiple random reservoir instantiations into feature ensembles and averages the outputs of independent SLDA heads, allowing for a tunable bias-variance tradeoff. Experimental results demonstrate that CIRCLE is competitive on standard datasets like CIFAR-100 and ImageNet, significantly outperforming existing CS-EFCIL baselines at higher task splits while requiring less training time.
Methodology
CIRCLE employs a fixed bidirectional two-dimensional reservoir feature extractor, adapted from BiRC2D, and utilizes streaming linear discriminant analysis (SLDA) heads for classification. The model averages softmax outputs from multiple random reservoir instantiations, enabling sample-wise training without the need for replay or task-boundary information.
Results
CIRCLE outperforms strong CS-EFCIL baselines at 50, 100, and 500 task splits on datasets such as CIFAR-100, TinyImageNet, ImageNet-Subset, and ImageNet-1k. It achieves competitive performance at 10-20 task splits while training significantly faster than methods that rely on trained backbones.
Implications
The findings suggest that CIRCLE can be effectively applied in scenarios where data privacy is a concern or where it is impractical to store past data, such as in real-time learning systems or edge devices. The model's efficiency and performance at long task horizons could also benefit applications in robotics and autonomous systems.
Hallucination in World Models is Predictable and Preventable
Generative Models
Reinforcement Learning
Robotics
- Introduction of MMBench2, a large dataset for visual world modeling with ground-truth actions and rewards.
- Identification of three distinct hallucination modes in generative world models.
- Development of three predictors for detecting hallucinations without additional training.
- Implementation of a coverage-aware training method that enhances rollout fidelity.
Read more
Hallucination in World Models is Predictable and Preventable
Summary
This paper addresses the issue of hallucination in modern generative world models, which produce visually fluent but often inaccurate rollouts that deviate from ground-truth dynamics. The authors hypothesize that hallucination occurs primarily in low-coverage regions of the state-action space and can be both predicted and mitigated using lightweight data-centric signals. To validate this hypothesis, they introduce MMBench2, a comprehensive dataset comprising 427 hours of visual world modeling data across 210 tasks, complete with ground-truth actions and rewards. The study identifies three distinct modes of hallucination—perceptual, action-marginalized, and scene-diverging—each linked to different stages of the model pipeline. The authors develop three predictors that can detect these hallucinations in real-time and propose a coverage-aware sampling technique for training, along with a curiosity-driven online data collection method. These approaches enable the adaptation of a pretrained world model to new environments with minimal real-world data, demonstrating that hallucination is fundamentally a coverage issue that can be addressed through targeted data collection and training strategies.
Methodology
The authors created MMBench2, a dataset for visual world modeling, and trained a 350M-parameter world model. They characterized hallucinations into three modes and developed predictors to detect them. They employed coverage-aware sampling during training and curiosity rewards for targeted online data collection to mitigate hallucinations.
Results
The study found that hallucination in world models is largely a coverage problem, with the proposed methods allowing adaptation to unseen environments using as few as 50 real trajectories. The predictors effectively identified hallucinations, leading to improved model performance and rollout fidelity.
Implications
The findings suggest that addressing data coverage can significantly enhance the reliability of generative world models. This has implications for various applications in robotics and AI planning, where accurate predictions of future states are crucial for decision-making.
Equivariance and Augmentation for Bayesian Neural Networks
Theory
- The paper establishes a theoretical framework for understanding how data augmentation induces equivariance in variational Bayesian inference.
- Three novel symmetrization techniques are introduced to improve the equivariance properties of BNNs.
- The study shows that starting from an invariant prior allows the variational distribution to maintain its invariance during training.
- Numerical experiments validate the theoretical results, with orbit expansion outperforming baseline methods.
Read more
Equivariance and Augmentation for Bayesian Neural Networks
Summary
This paper investigates the role of symmetries in deep learning, particularly in the context of Bayesian Neural Networks (BNNs) trained with variational inference on augmented data. The authors explore the balance between imposing symmetry constraints through equivariant neural networks and learning symmetries from augmented training data. They derive conditions under which exact equivariance can be achieved and introduce three novel symmetrization techniques—geometric averaging, projection, and orbit expansion—to enhance the effects of data augmentation. Theoretical results demonstrate that when training starts from an invariant prior, the variational distribution remains invariant throughout training under mild assumptions. The authors validate their theoretical findings through extensive numerical experiments, showing that the orbit expansion method significantly outperforms baseline models in terms of both equivariance and overall performance.
Methodology
The authors analyze BNNs trained on augmented data using variational distributions from the exponential family. They derive theoretical conditions for equivariance and introduce symmetrization techniques to enhance model performance. The methodology includes extensive numerical experiments to validate the theoretical findings.
Results
The results indicate that the proposed symmetrization method, orbit expansion, significantly improves both the equivariance and overall performance of BNNs compared to baseline models. The theoretical bounds on equivariance error were validated through numerical experiments, confirming the effectiveness of the proposed techniques.
Implications
The findings have potential implications for various applications in deep learning, particularly in fields requiring robust models that can handle symmetries, such as medical imaging and scientific computing. The techniques developed could enhance the performance of BNNs in small data regimes, making them more applicable in real-world scenarios.
Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees
Theory
- Introduces a two-stage adapter for embedding foundation model predictions in discrete-choice models.
- Preserves structural guarantees of multinomial logit models while enhancing prediction accuracy.
- Demonstrates significant accuracy improvements (up to 12.8 percentage points) across various datasets.
- Maintains cost monotonicity and produces realistic willingness-to-pay estimates.
Read more
Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees
Summary
This paper addresses the limitations of tabular foundation models in discrete-choice prediction tasks, where predictions often violate economic principles. The authors propose a two-stage adapter that incorporates foundation model predictions into a multinomial logit framework while preserving structural guarantees. In the first stage, structural coefficients are estimated using maximum likelihood with sign constraints, while in the second stage, a neural correction is applied to the foundation model's predictions. The method ensures that the marginal rate of substitution is preserved, thus providing a mathematically sound basis for value-of-time calculations. The proposed adapter demonstrates significant improvements in accuracy across multiple datasets and foundation models, while maintaining cost monotonicity and producing plausible willingness-to-pay estimates. The results indicate that the adapter can effectively bridge the gap between high predictive accuracy and the structural integrity required in economic modeling.
Methodology
The methodology involves a two-stage process: first, fitting the structural coefficients of a multinomial logit model using maximum likelihood with sign constraints; second, applying a neural correction to the foundation model's predictions while keeping the structural coefficients fixed. This approach ensures the preservation of the marginal rate of substitution and maintains the economic integrity of the model.
Results
The proposed adapter achieved an average test accuracy improvement of 6.4 percentage points over the multinomial logit model and up to 12.8 percentage points in certain cases. It maintained 100% cost monotonicity and produced values of time consistent with established transportation economics. The accuracy gains were statistically significant across all evaluated datasets and foundation models.
Implications
The findings suggest that the proposed adapter can be widely applied in economic modeling and policy decision-making, particularly in transportation and consumer behavior contexts. By ensuring both high accuracy and structural integrity, this approach can enhance the reliability of predictions used in significant economic investments and interventions.
Decision-Aligned Evaluation of Uncertainty Quantification
Theory
- Introduces decision-alignment as a formal criterion for evaluating UQ metrics.
- Identifies that many traditional UQ metrics are misaligned with downstream decision utilities.
- Proposes prior-weighted utility metrics that better capture the value of models for decision-making.
- Demonstrates through experiments that prior-weighted metrics align more closely with real-world decision utilities.
Read more
Decision-Aligned Evaluation of Uncertainty Quantification
Summary
This paper addresses the evaluation of uncertainty quantification (UQ) in machine learning, highlighting that traditional metrics such as negative log-likelihood (NLL) and expected calibration error (ECE) do not necessarily correlate with the utility of decisions made based on these uncertainties. The authors introduce a novel criterion called decision-alignment, which assesses how well UQ metrics align with downstream decision-making utilities. They demonstrate that many existing UQ metrics are either misaligned with common decision problems or reflect flawed prior beliefs about the tasks. To address these issues, the authors propose prior-weighted utility metrics, a new class of proper scoring rules designed to provide decision-aligned evaluations. Through extensive benchmark experiments and real-world case studies, they show that these new metrics consistently align with actual decision utilities, unlike conventional metrics, thereby revealing significant flaws in current UQ evaluation protocols and suggesting a principled way to enhance UQ evaluation towards decision relevance.
Methodology
The authors conducted a systematic analysis of common UQ metrics using the decision-alignment framework. They proposed prior-weighted utility metrics and validated their effectiveness through benchmark experiments in classification and regression tasks, comparing the performance of these new metrics against traditional ones.
Results
The results indicated that prior-weighted utility metrics reliably aligned with real downstream utilities, while conventional metrics like NLL and ECE often failed to do so. This finding underscores the inadequacy of existing UQ evaluation methods and the necessity for metrics that are directly relevant to decision-making.
Implications
The proposed decision-aligned evaluation framework has significant implications for the development and deployment of UQ methods in safety-critical applications. By ensuring that UQ metrics are aligned with decision-making utilities, practitioners can improve the reliability and effectiveness of machine learning models in real-world scenarios.
Finding Stationary Points by Comparisons
Optimization
Theory
- Developed an algorithm for finding ϵ-stationary points using a comparison oracle with improved query complexity.
- Introduced a quantum algorithm that reduces query complexity significantly in the quantum oracle model.
- Demonstrated the relevance of the findings to practical optimization problems in machine learning.
- Identified the need for further research on lower bounds in the context of comparison-based optimization.
Read more
Finding Stationary Points by Comparisons
Summary
This paper addresses the challenge of finding stationary points of non-convex functions using a comparison oracle, which only provides relative function values between pairs of points. The authors develop an algorithm that can find an ϵ-stationary point with eO(n²/ϵ¹.⁵) queries, leveraging a subroutine to estimate the normalized Hessian. Additionally, they explore a quantum comparison oracle model, introducing a quantum algorithm that achieves the same goal with eO(n/ϵ¹.⁵) queries. The significance of this work lies in its potential applications in optimization problems where traditional gradient-based methods are infeasible, particularly in scenarios involving limited feedback, such as preference-based reinforcement learning. The paper also highlights the need for further exploration of lower bounds in comparison settings, as the current understanding of dimension-dependent complexities remains limited.
Methodology
The authors propose an algorithm that utilizes a comparison oracle to determine which of two points has a larger function value, allowing for the identification of stationary points without direct function evaluations. The algorithm involves estimating the normalized Hessian and operates under both classical and quantum models, with the latter allowing for superposition queries.
Results
The algorithm guarantees that one of the queried points is an ϵ-stationary point with a query complexity of eO(n²/ϵ¹.⁵) for the classical model and eO(n/ϵ¹.⁵) for the quantum model. The results show an improved dependence on ϵ compared to existing methods, although the n² dependence raises questions about optimality in high dimensions.
Implications
The findings have significant implications for optimization in machine learning, particularly in areas where access to function evaluations is limited, such as preference-based reinforcement learning. The results could enhance the efficiency of algorithms used in various applications, including tensor decomposition and matrix completion.
Rethinking Training & Inference for Forecasting: Linking Winner-Take-All back to GMMs
Robotics
Time Series
Theory
- Identifies a fundamental mismatch in training objectives for trajectory forecasting models.
- Proposes two effective post-hoc treatments to improve mode probability assignments.
- Demonstrates that the WTA training objective can lead to over-segmentation of trajectory predictions.
- Provides a unified perspective on GMMs and K-means clustering in the context of forecasting.
Read more
Rethinking Training & Inference for Forecasting: Linking Winner-Take-All back to GMMs
Summary
This paper addresses the challenges in trajectory forecasting for autonomous driving, particularly the issue of uninformative posteriors over forecast modes that complicate mode pruning. The authors identify a mismatch between the modeling of forecasters as conditional Gaussian mixture models (GMMs) and the winner-take-all (WTA) loss used for training, which leads to over-segmentation of the trajectory space and instability in mode assignments. To rectify this, they propose two post-hoc treatments: (1) test-time posterior-weighted merging to aggregate nearby candidate trajectories, and (2) a one-step expectation-maximization (EM) update that replaces hard labels with soft responsibilities, allowing for better sharing of probability mass across neighboring modes. These methods enhance the informativeness of mode posteriors and improve forecast accuracy on standard displacement metrics without requiring retraining. The paper provides a theoretical framework linking GMMs and K-means clustering, offering insights into the strengths and weaknesses of the WTA approach in trajectory forecasting.
Methodology
The authors analyze the limitations of existing trajectory forecasting models that utilize a WTA loss for training. They introduce two corrective measures: a test-time posterior-weighted merging technique to combine similar trajectory predictions and a one-step EM update to transition from hard to soft assignments of probabilities, thereby enhancing the model's output without retraining.
Results
The proposed methods yield more informative and accurately ranked mode posteriors, leading to improved performance on displacement metrics in trajectory forecasting tasks. The enhancements are achieved without the need for retraining the underlying models, demonstrating the effectiveness of the proposed corrections.
Implications
The findings suggest that aligning training objectives with probabilistic inference can significantly enhance the performance of trajectory forecasting models in autonomous driving applications. This work may influence future research directions in motion forecasting and improve the safety and reliability of autonomous vehicle systems.
Error-Conditioned Neural Solvers
Optimization
Theory
Efficient ML
- ENS utilizes the PDE residual as an input rather than an optimization target, enabling better error correction.
- The framework achieves significant improvements in prediction accuracy, especially in ill-conditioned scenarios.
- ENS demonstrates robustness to initialization and generalizes effectively under distribution shifts.
- Theoretical and empirical analysis reveals the limitations of existing hybrid methods in achieving accurate solutions despite low residuals.
Read more
Error-Conditioned Neural Solvers
Summary
The paper introduces Error-Conditioned Neural Solvers (ENS), a novel framework for solving partial differential equations (PDEs) that addresses the limitations of traditional neural surrogate models. These models often treat PDE solving as a statistical regression problem, leading to issues with constraint violations and poor extrapolation capabilities. Existing hybrid methods attempt to incorporate physical correctness by minimizing the PDE residual but suffer from high computational costs and instability. The authors demonstrate that minimizing the PDE residual can be an unreliable measure of reconstruction accuracy, particularly in ill-conditioned systems. ENS improves upon these methods by using the PDE residual as a direct input to the network, allowing it to learn an update policy that iteratively corrects its predictions based on the spatial structure of its own errors. The framework shows significant improvements in prediction accuracy across various PDE families, achieving up to a 10x improvement in turbulent Kolmogorov flow scenarios while maintaining lower computational costs compared to hybrid methods. ENS also exhibits robustness to initialization and generalizes well under distribution shifts, making it particularly effective in challenging conditions where traditional methods struggle.
Methodology
The ENS framework incorporates the PDE residual field as a direct input to the neural network at each iteration, allowing the model to learn how to correct its predictions based on its own error structure. This approach avoids the pitfalls of minimizing the residual directly and is trained under reconstruction supervision, leading to iterative refinement of the solution without explicit numerical optimization.
Results
ENS outperforms existing methods across four families of PDEs, achieving the highest prediction accuracy in most settings, with improvements of up to an order of magnitude in challenging cases. The framework also maintains low PDE residuals alongside low reconstruction errors, demonstrating its effectiveness in ill-conditioned regimes where traditional methods falter.
Implications
The development of ENS has significant implications for the field of computational physics and engineering, providing a more efficient and accurate approach to solving PDEs. Its ability to generalize under distribution shifts and improve accuracy in ill-conditioned scenarios could enhance simulations in various applications, including fluid dynamics and material science.
Structure Before Collapse: Transient semantic geometry in next-token prediction
NLP
Large Language Models
Theory
- Neural Collapse theory predicts that one-hot supervision erases latent semantic structure, yet language models learn such structure.
- The authors introduce Representational Similarity Analysis (RSA) to track semantic structure in learned representations.
- Three synthetic languages are identified where models recover structured semantic geometry despite one-hot supervision.
- The semantic geometry is transient, emerging early in training before collapsing to a symmetric ETF.
Read more
Structure Before Collapse: Transient semantic geometry in next-token prediction
Summary
This paper investigates the phenomenon of Neural Collapse (NC) in the context of next-token prediction in language models, where one-hot labels are predominantly used for training. The authors highlight a contradiction in NC theory, which suggests that such one-hot supervision should erase latent semantic structures from model representations. However, the authors demonstrate that language models can learn structured semantic geometry even under one-hot regimes. They introduce controlled synthetic languages with known latent structures and employ Gradient Descent to analyze how these models recover semantic geometry early in training before eventually collapsing to the symmetric state predicted by NC theory. The findings reveal that this semantic structure is transient, emerging early in training and disappearing as the model converges to a symmetric equiangular tight frame (ETF). The paper also proposes a simplified mathematical model to explain this transient alignment and introduces Representational Similarity Analysis (RSA) as a method to track the evolution of learned representations.
Methodology
The authors created controlled synthetic toy languages with known latent semantic structures and trained transformer models on these languages. They employed Gradient Descent to observe the evolution of representations during training and used Representational Similarity Analysis (RSA) to assess the presence of latent semantic structures.
Results
The study found that language models recover structured semantic geometry early in training, despite being trained with one-hot labels. This geometry is transient and eventually collapses to a symmetric ETF configuration as training progresses, aligning with NC theory predictions.
Implications
The findings suggest that language models can learn meaningful semantic structures even in sparse training regimes, which may inform future model designs and training strategies. Understanding this transient phase could lead to improved performance in language tasks and insights into the underlying mechanisms of language learning.
Batch-Invariant Spectral Intelligence for Robust and Explainable Insect Authentication
Interpretability
- Introduction of the Batch-Invariant Spectral Network (BISN) for insect authentication.
- BISN effectively suppresses batch-specific spectral variations before learning species features.
- Achieved a mean accuracy of 0.93 in classifying insect species across different batches.
- Utilized explainable AI to confirm model reliance on relevant biochemical absorption regions.
Read more
Batch-Invariant Spectral Intelligence for Robust and Explainable Insect Authentication
Summary
This paper addresses the challenge of reliable species authentication for edible insects, which is crucial for ensuring food safety and regulatory compliance. The authors introduce the Batch-Invariant Spectral Network (BISN), an innovative framework that combines a learnable preprocessing module with an entropy-regularised adversarial objective to mitigate batch-specific spectral variations in near-infrared (NIR) spectroscopy data. Unlike traditional methods that adapt after feature extraction, BISN suppresses batch effects before learning species-specific features, enhancing robustness against production batch variations. The framework was evaluated using 2,700 spectral measurements from three insect species across different production batches, achieving a mean leave-one-batch-out accuracy of 0.93, significantly outperforming existing methods. Additionally, explainable AI techniques were employed to demonstrate that the model's decisions are consistently based on relevant biochemical features, linking predictive performance to known insect biochemistry. The BISN framework thus provides a promising solution for automated insect species authentication in industrial settings, ensuring both accuracy and interpretability.
Methodology
The BISN framework integrates a learnable preprocessing module initialized with Savitzky–Golay filtering and employs an entropy-regularised adversarial objective to eliminate batch-specific information from spectral data. This approach contrasts with traditional domain adaptation methods by addressing batch effects prior to feature extraction.
Results
The BISN framework demonstrated a mean leave-one-batch-out accuracy of 0.93 (standard deviation 0.04) on spectral data from three insect species, outperforming the strongest baseline by 4% (p < 10^-6). Explainable AI insights indicated that the model's predictions were consistently based on lipid and protein absorption regions.
Implications
The findings suggest that BISN can significantly enhance the reliability of automated insect species authentication in industrial food processing, contributing to food safety, allergen control, and regulatory compliance. The model's interpretability also aids in understanding the biochemical basis of species discrimination.
Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search
Reinforcement Learning
NLP
Optimization
- Introduces a novel RLAIF framework for generating portable job search queries.
- Identifies and addresses the issue of reward-hacking leading to verbatim copying.
- Demonstrates that robust reward shaping is more impactful than the choice of RL optimizer.
- Implements a rule-based reward floor to improve query generation quality.
Read more
Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search
Summary
This paper addresses the challenges of generating effective job search queries in low-bandwidth interfaces, such as those used in job-search platforms like LinkedIn. The authors propose an end-to-end Reinforcement Learning from AI Feedback (RLAIF) framework aimed at producing portable job search queries that abstract away seeker-specific identifiers while maintaining generalizable qualifications. The study identifies a significant issue with reward signals in RLAIF, where optimization can lead to undesirable behaviors, such as verbatim copying of input profiles. Through empirical experiments, the authors demonstrate that robust reward shaping is crucial for performance, overshadowing the choice of optimization algorithm. They introduce a deterministic rule-based reward floor to mitigate the verbatim copying issue, leading to a notable improvement in query quality. The findings emphasize the importance of reward signal design in RLAIF systems, suggesting that effective reward engineering can significantly enhance the performance of query generation models.
Methodology
The authors developed an RLAIF framework that focuses on reward signal formulation rather than model architecture. They conducted empirical experiments to evaluate the impact of different optimization mechanics and reward shaping strategies, including the introduction of a deterministic rule-based reward floor to prevent verbatim copying.
Results
The study found that the performance of critic-free optimizers was heavily influenced by the robustness of the reward shaping. The introduction of a rule-based reward floor resulted in a +0.147 quality improvement in query generation. Overall, the training-time reward model was shown to inflate performance gains by 2.4x, highlighting the critical role of reward engineering in RLAIF systems.
Implications
The findings suggest that effective reward signal design is essential for improving the quality of automated query generation systems in job search platforms. This has implications for enhancing user experience and ensuring equitable access to job opportunities, particularly for underrepresented groups in the job market.
Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
Multimodal
Large Language Models
Computer Vision
- Proposes a novel training-based framework for verbalized confidence calibration in Medical VQA.
- Introduces a composite loss function that combines multiple calibration and regularization techniques.
- Demonstrates a 60% reduction in calibration error and a 26% improvement in discrimination across benchmarks.
- Validates the method on two different model architectures, showing robustness and effectiveness.
Read more
Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
Summary
This paper addresses the issue of overconfidence in multimodal large language models (MLLMs) used for Medical Visual Question Answering (VQA). Traditional verbalized confidence calibration methods, designed for text-only models, fail to account for the multimodal nature of medical image understanding. The authors propose a novel training-based framework that fine-tunes MLLMs to enhance their calibration. This framework employs a composite loss function that integrates a Brier-style calibration term, an anchor regularizer to prevent extreme confidence values, a contrastive image-text alignment term, and a KL-based model stabilization term. The alignment signal is derived from a 2 × 2 factorial perturbation design that assesses the model's dependence on visual versus textual inputs. The methodology is validated across three Medical VQA benchmarks and two model architectures, demonstrating significant improvements in calibration and discrimination while maintaining predictive accuracy. The results indicate that the proposed method outperforms existing calibration techniques, confirming the necessity of each loss function component through ablation studies.
Methodology
The authors developed a training-based framework that fine-tunes MLLMs using a composite loss function. This function includes a Brier-style calibration term, an anchor regularizer, a contrastive alignment term from a factorial perturbation design, and a KL divergence regularizer to stabilize the model during training. The perturbation design assesses the model's reliance on visual and textual inputs.
Results
The proposed method achieved a reduction in calibration error by over 60% and improved discrimination by over 26% across three Medical VQA benchmarks. It outperformed prompting-based, sampling-based, and other training-based approaches, confirming the effectiveness of each component of the composite loss function through ablation experiments.
Implications
The findings suggest that improved verbalized uncertainty calibration can enhance the reliability of MLLMs in clinical applications, allowing clinicians to better assess model-generated suggestions and focus verification efforts on uncertain cases. This advancement could lead to more efficient and trustworthy medical AI systems.
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Theory
- Introduces Hausdorff distance as a metric for comparing ODEs, capturing worst-case separation of solutions.
- Establishes identifiability bounds for linear and nonlinear ODEs, detailing conditions for unique identification.
- Provides quantitative analysis of sample complexity, determining the number of observations needed for reliable recovery.
- Fills a significant gap in the theoretical understanding of learning governing equations from solution data.
Read more
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Summary
This paper addresses the challenge of recovering governing ordinary differential equations (ODEs) from observed solution data, a critical issue in scientific machine learning. The authors highlight the lack of theoretical foundations regarding the unique and stable identification of ground-truth ODEs from multiple observations. To bridge this gap, they introduce the Hausdorff distance as a metric for comparing differential equations, which captures the worst-case separation between solution trajectories. The paper establishes identifiability bounds for a broad class of ODEs, including linear and nonlinear equations with Lipschitz continuous vector fields. By analyzing the Hausdorff distance, the authors derive sample complexity bounds that quantify the number of solution observations required for reliable recovery of governing equations. The findings provide a theoretical basis for understanding when uniqueness in identifying governing equations is guaranteed and offer insights into the complexity of learning tasks in this domain.
Methodology
The authors formalize the identification problem by measuring the distance between ODEs using the Hausdorff distance on their solution sets. They derive upper and lower bounds on this distance for various classes of ODEs and analyze the implications for sample complexity and metric entropy, which quantifies the richness of the ODE classes in terms of Hausdorff distance.
Results
The paper presents specific identifiability bounds for different classes of ODEs, including linear and Lipschitz ODEs. It also provides examples of lower and upper bounds on the Hausdorff distance, illustrating the conditions under which distinct equations can be distinguished from solution data. The analysis reveals the necessary sample sizes for reliably recovering governing equations.
Implications
The findings have significant implications for scientific machine learning, particularly in fields where understanding governing equations is crucial, such as physics and engineering. The theoretical foundations laid out in this work can guide future research on learning differential equations and enhance the development of algorithms for system identification.
Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting
Time Series
- Introduction of a novel adaptive framework for bioprocess forecasting.
- Combination of GB-Latent ODE and MP-JIT-FT for generating multiple plausible future trajectories.
- Integration of Raman spectroscopy data to enhance model training and forecasting accuracy.
- Demonstrated superior performance on real-world bioreactor data compared to traditional methods.
Read more
Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting
Summary
This paper addresses the challenges of forecasting mammalian cell culture processes, which are critical for biopharmaceutical manufacturing. The authors propose a novel adaptive framework that combines a Gated Bottleneck Latent Ordinary Differential Equation (GB-Latent ODE) with Multi-Path Just-In-Time Fine-Tuning (MP-JIT-FT). This approach allows for the generation of multiple plausible future trajectories based on historical data, rather than a single averaged forecast. The GB-Latent ODE enhances the standard Latent ODE by incorporating learnable gating mechanisms and a bottleneck structure to better handle high-dimensional sparse inputs. Additionally, the framework integrates Raman spectroscopy data through a machine-learning soft sensor, which enriches the sparse offline measurements, improving the robustness of the model. The proposed method was evaluated on 38 fed-batch bioreactor runs across 14 different conditions, demonstrating superior performance compared to a global Latent ODE baseline on 8 out of 9 target variables. The results indicate that the multi-path forecasting approach is particularly beneficial when early trajectory patterns diverge, and the Raman data fusion significantly enhances model performance when early dynamics are indicative of later behavior.
Methodology
The methodology involves the development of a Gated Bottleneck Latent ODE that incorporates variable-wise gating and a mask-aware bottleneck to manage high-dimensional sparse inputs. The Multi-Path Just-In-Time Fine-Tuning retrieves similar historical trajectories, clusters them into candidate regimes, and fine-tunes separate models for each regime to produce multiple forecasts. Additionally, Raman spectroscopy data is fused into the model to create pseudo-observations that enhance the training process.
Results
The proposed framework achieved the best average rank on forecasting accuracy and outperformed a global Latent ODE baseline on 8 out of 9 target variables across 38 fed-batch bioreactor runs. The analysis using local-divergence metrics indicated that the multi-path forecasting approach was most effective when early trajectory patterns diverged, while Raman data fusion significantly improved performance when early dynamics were representative of later behavior.
Implications
The findings suggest that the adaptive framework can significantly improve forecasting accuracy in biopharmaceutical manufacturing, allowing for timely interventions in cell culture processes. This could lead to more efficient production cycles and reduced risks of off-specification batches, ultimately enhancing the quality and reliability of biopharmaceutical products.
Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding
Theory
- Introduction of CHAUN for improved uplift modeling through attention mechanisms.
- Theoretical proof of ITE identifiability with true propensity scores despite unobserved confounding.
- RA-IPS method to optimize propensity weights and mitigate selection bias.
- Empirical validation showing up to 25.6% improvement in QINI scores over state-of-the-art models.
Read more
Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding
Summary
This paper addresses the challenges of uplift modeling, which is essential for estimating individual treatment effects (ITE) in the presence of unobserved confounding variables. The authors propose the Cross-Head Attention Uplift Network (CHAUN) and a Robust Adversarial Inverse Propensity Score (RA-IPS) method to enhance the model's ability to leverage inter-group similarities and mitigate biases from unobserved confounders. CHAUN utilizes shared feature embeddings and cross-head attention mechanisms to create treatment-specific and control-specific representations, thereby improving the modeling of inter-group correlations. The theoretical foundation of the paper demonstrates that access to true propensity scores can ensure ITE identifiability, even when unobserved confounders are present. In practical scenarios where true propensity scores are unavailable, RA-IPS optimizes propensity weights adversarially within constrained uncertainty sets to reduce bias. The effectiveness of CHAUN and RA-IPS is validated through experiments on public datasets and a production e-commerce dataset, showing significant improvements in QINI scores compared to existing uplift models.
Methodology
The methodology involves the development of the Cross-Head Attention Uplift Network (CHAUN) which employs shared feature embeddings and attention mechanisms to create dynamic treatment-specific and control-specific representations. Additionally, the Robust Adversarial Inverse Propensity Score (RA-IPS) method is introduced to optimize propensity weights in the presence of unobserved confounders.
Results
The experiments conducted on public datasets (CRITEO-UPLIFT, LAZADA) and a production e-commerce dataset demonstrate that CHAUN significantly outperforms existing uplift models, achieving relative improvements of up to 25.6% in QINI scores. Furthermore, RA-IPS enhances robustness, outperforming standard IPS by 5.4% under conditions of unobserved confounding.
Implications
The proposed methods have significant implications for real-world applications in causal inference tasks, particularly in fields such as marketing, healthcare, and any domain where treatment effects need to be estimated accurately despite the presence of unobserved confounders.
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Large Language Models
Efficient ML
NLP
- EPIKV scores tokens based on internal representation changes rather than attention weights, improving eviction quality.
- The method allows for a 16× longer feasible context length compared to traditional attention-based scoring methods.
- EPIKV matches or exceeds the performance of existing attention-based baselines on MATH-500 and AIME-2024 benchmarks.
- The approach runs up to 2.8× faster than attention-based eviction methods at equal budget.
Read more
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Summary
This paper addresses the limitations of existing key-value (KV) cache eviction methods in reasoning models, particularly those that rely on attention weights, which can be noisy and memory-intensive. The authors introduce a novel cache eviction method called EPIKV, which utilizes an 'epiphany score' to evaluate the importance of tokens based on changes in the model's internal representation during the forward pass, without the need for an attention matrix. This approach allows for a significant increase in feasible context length during inference, scaling up to 16 times longer than traditional methods. The authors demonstrate that EPIKV matches or exceeds the performance of leading attention-based methods on benchmark tasks while also being faster and more memory-efficient. The findings suggest that EPIKV can effectively enhance the deployment of reasoning models in production environments.
Methodology
The authors developed EPIKV by analyzing the changes in hidden states at specific layers of a reasoning model during the forward pass. They identified two bands of layers that correlate positively and negatively with token importance, allowing them to derive a more reliable importance signal without relying on the attention matrix. The method was implemented in standard FlashAttention stacks, ensuring compatibility with existing inference architectures.
Results
EPIKV achieved a performance score of 72% on the MATH-500 benchmark and 37% on the AIME-2024 benchmark, outperforming or matching the best attention-based methods while significantly reducing memory usage and increasing processing speed.
Implications
The findings suggest that EPIKV could be widely adopted in production environments for reasoning models, enabling more efficient memory management and faster inference times. This could lead to improved scalability and performance in applications requiring long-context reasoning.
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Reinforcement Learning
NLP
Multimodal
- Introduction of VLM-PBRS, a framework that automates potential-based reward shaping using VLM feedback.
- Utilization of smaller, cost-effective VLMs to generate preference labels, reducing computational burden.
- Empirical validation showing improved sample efficiency and robustness to reward hacking in RL environments.
- Demonstration of the relationship between VLM label accuracy and learning efficiency.
Read more
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Summary
This paper addresses the challenges of sparse rewards in reinforcement learning (RL) by introducing a novel framework called VLM-PBRS, which automates potential-based reward shaping (PBRS) using feedback from vision language models (VLMs). Sparse rewards can hinder exploration and learning efficiency, often leading to reward hacking when naive reward shaping is applied. PBRS offers a theoretically sound approach by preserving optimal policies while providing richer learning signals, but it typically requires a carefully designed potential function. The authors propose to learn this potential function directly from VLM-generated preferences over image pairs, thus eliminating the need for expert-designed reward shaping terms. The methodology employs smaller, computationally efficient VLMs to generate preference labels, which, despite being less accurate, still enhance learning speed. The effectiveness of VLM-PBRS is empirically validated in the Meta-World and Franka Kitchen environments, demonstrating improved sample efficiency and robustness against reward hacking. The study highlights the relationship between the accuracy of VLM preference labels and the efficiency of the learning process, marking a significant advancement in automating reward shaping in RL.
Methodology
The VLM-PBRS framework learns a potential function for PBRS by querying a lightweight vision language model for preferences over image pairs. This approach leverages the policy invariance of PBRS, allowing the use of less accurate preference labels from smaller VLMs, which are computationally efficient compared to larger models. The method is validated through experiments in specific RL environments.
Results
The empirical results indicate that VLM-PBRS significantly enhances sample efficiency and demonstrates robustness against reward hacking. The experiments conducted in the Meta-World and Franka Kitchen environments confirm that even with less accurate preference labels, the learning process is accelerated, showcasing the effectiveness of the proposed framework.
Implications
The findings suggest that VLM-PBRS can streamline the reward shaping process in reinforcement learning, making it more accessible and less reliant on expert knowledge. This could lead to broader applications of RL in complex environments where manual reward design is impractical.
Reasoning Quality Emerges Early: Data Curation for Reasoning Models
NLP
Large Language Models
Efficient ML
- Introduces a new method for data curation that relies on initial reasoning tokens rather than strong reasoning models.
- Demonstrates that challenging examples can be identified based on loss metrics from early reasoning tokens.
- Establishes a correlation between loss patterns and gradient similarity during fine-tuning.
- Achieves up to 1.7% performance improvement over existing baselines while being 91% more token efficient.
Read more
Reasoning Quality Emerges Early: Data Curation for Reasoning Models
Summary
This paper presents a novel approach to data curation for supervised fine-tuning (SFT) of reasoning models, particularly large language models (LLMs). The authors argue that traditional methods for curating high-quality SFT data are costly and often yield suboptimal results, as they rely on strong reasoning models to filter examples based on diversity and difficulty. Instead, the authors propose a method that identifies diverse and challenging reasoning examples using only the initial reasoning tokens. They demonstrate that difficult problems can be detected by analyzing the loss of the first 100 reasoning tokens evaluated at a perturbed checkpoint of the pretrained model. Furthermore, they establish that examples with similar loss patterns over their first 1,000 tokens induce similar gradients during training. The proposed Token-Efficient Model Perturbation (TEMP) method allows for efficient curation of SFT datasets that are both diverse and challenging, leading to improved training outcomes for reasoning models. The effectiveness of this approach is validated through extensive experiments on the Qwen2.5-7B and Llama3.1-8B models, showing significant improvements in performance while being more token efficient.
Methodology
The authors utilize a two-step process for data selection: first, they identify challenging examples by analyzing the loss of the first 100 tokens at a perturbed checkpoint. Then, they cluster examples based on the loss values of their first 1,000 tokens across several perturbed checkpoints to ensure diversity. This method, termed Token-Efficient Model Perturbation (TEMP), allows for efficient curation of a high-quality SFT dataset.
Results
The proposed TEMP method outperforms existing baselines by up to 1.7% in performance metrics while achieving a 91% increase in token efficiency. Extensive experiments were conducted using the Qwen2.5-7B and Llama3.1-8B models on the M23K medical reasoning and OpenThoughts-Math datasets.
Implications
The findings suggest that early identification of challenging reasoning examples can significantly enhance the efficiency and effectiveness of training large language models. This has potential applications in various reasoning-intensive tasks across mathematics, programming, and scientific domains.
Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
Graph Learning
Interpretability
Time Series
- Introduction of a novel Event Relevance (ER) method for explaining ETGNNs.
- Extension of the Normalized Relevance Measure (NRM) framework to handle complex neural architectures.
- Demonstration of superior qualitative and quantitative performance over existing explanation methods.
- Focus on capturing the entire information flow, including intermediate event-induced variables.
Read more
Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
Summary
This paper addresses the challenge of explainability in Event-based Temporal Graph Neural Networks (ETGNNs), which are used in various applications such as social network analysis and epidemic tracing. Existing explanation methods primarily focus on the final stages of the network, neglecting the upstream information flow that is crucial for understanding long-range temporal dependencies. The authors propose a novel attribution method called Event Relevance (ER), which analyzes the entire information flow through all event-associated variables, including intermediate features that mediate interactions between nodes. This method builds on the Normalized Relevance Measure (NRM) framework, allowing for explicit quantification of information flow from event embeddings and facilitating higher-order analysis of event interactions. The authors extend the NRM framework with a modular decomposition procedure to simplify the application of their method to complex ETGNN architectures. The proposed approach is evaluated on synthetic datasets for epidemic tracing and social dynamics, as well as a real-world dataset of political events, demonstrating superior performance in generating human-interpretable explanations compared to existing methods.
Methodology
The authors developed the Event Relevance (ER) method based on the Normalized Relevance Measure (NRM) framework, which allows for the quantification of information flow through all event-associated variables. They introduced a modular decomposition procedure to simplify the relevance definition in complex ETGNN architectures, enabling a comprehensive analysis of the information flow and interactions among events.
Results
The proposed ER method consistently outperformed existing explanation approaches in both qualitative and quantitative evaluations. The experiments conducted on synthetic datasets and a real-world political event dataset showed that ER provides more human-interpretable explanations and captures the entire information flow more faithfully than previous methods.
Implications
The findings suggest that the proposed ER method can enhance the interpretability of ETGNNs in high-stakes applications, where understanding model predictions is critical. This could lead to better decision-making in fields such as public health, social dynamics, and political forecasting.
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
NLP
Large Language Models
Efficient ML
- Introduces Hankel Reduced-order Model (HRM) adapters for parameter-efficient fine-tuning.
- HRM outperforms LoRA variants on LongBench tasks with significant accuracy improvements.
- Demonstrates consistent performance across diverse configurations in state-tracking and language modeling.
- Provides a computationally efficient method for integrating temporal memory into frozen transformer models.
Read more
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Summary
This paper investigates the effectiveness of parameter-efficient fine-tuning (PEFT) methods, particularly focusing on state space model (SSM) adapters for tasks requiring sequential state accumulation. The author introduces the Hankel Reduced-order Model (HRM) adapter, which is an SSM-based residual module initialized through Balanced Truncation of empirical Hankel Grammians. The HRM adapter allows for an exact FFT-based parallel scan, achieving computational efficiency comparable to Low-Rank Adaptation (LoRA) across varying context lengths. Evaluations on the Mistral-7B model demonstrate that HRM significantly outperforms LoRA variants on LongBench tasks, including substantial improvements in accuracy and ROUGE-1 scores. The HRM adapter also shows consistent superiority across various configurations in synthetic state-tracking and character-level language modeling tasks. Additionally, gate analysis indicates that HRM effectively learns to modulate recurrence, presenting a robust alternative to low-rank adaptation for long-context sequence modeling.
Methodology
The paper proposes the HRM adapter, which is initialized using Balanced Truncation of Hankel Grammians. It leverages the time-invariance of the system matrix to enable efficient FFT-based computations. The HRM adapter is integrated into the architecture of pre-trained models, allowing for the addition of temporal recurrent states while keeping the backbone model frozen.
Results
In iso-parametric evaluations on the Mistral-7B model, HRM achieved a +34.8% relative accuracy improvement on the QuALITY task and a +71.6% relative ROUGE-1 improvement on the QMSum task compared to LoRA variants. HRM also demonstrated superior performance across 18 configurations in synthetic state-tracking and character-level language modeling tasks.
Implications
The findings suggest that HRM adapters can enhance the performance of large language models on tasks requiring long-context understanding, making them suitable for applications in natural language processing where sequential state accumulation is critical. This approach could lead to more efficient fine-tuning methods for various downstream tasks without the need for extensive retraining.
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
- Introduction of GEOALIGN to address directional inconsistency in online RL for LLMs.
- GEOALIGN operates as a lightweight plug-in for rollout curation, enhancing training stability.
- The method improves performance in dialogue alignment and mathematical reasoning tasks.
- GEOALIGN demonstrates resilience against controlled reward corruption.
Read more
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Summary
The paper introduces GEOALIGN, a novel approach to improve the stability and performance of online reinforcement learning (RL) for aligning large language models (LLMs) with reward signals. The authors identify a critical issue termed 'directional inconsistency,' where a small number of high-reward rollouts can lead to conflicting update directions that destabilize training. GEOALIGN addresses this by implementing a lightweight rollout curation module that operates during policy optimization. It forms preference pairs from rollouts, learns a projector to concentrate reward-ordered directions, and identifies directionally inconsistent rollouts to rectify them with stable alternatives. The method is designed to be efficient, requiring only forward passes and adding minimal computational overhead. The authors evaluate GEOALIGN on tasks involving dialogue alignment and mathematical reasoning, demonstrating significant improvements in both final performance and training stability compared to existing robust RL methods. The findings suggest that leveraging latent directional consensus can serve as an effective reliability signal for online LLM reinforcement learning.
Methodology
GEOALIGN employs a series of steps for rollout curation: it forms within-prompt preference pairs, learns a projector to distill reward-ordered directions into a concentrated manifold, builds a batch-wise consensus prototype, and identifies and rectifies directionally inconsistent rollouts using stable alternatives. This process is executed during the forward pass, ensuring low computational overhead.
Results
GEOALIGN outperformed several robust RL baselines, including PF-PPO, PAR, PODS, and Seed-GRPO, in both dialogue alignment and mathematical reasoning tasks. The method not only improved final performance but also reduced training oscillation, demonstrating enhanced stability under conditions of reward corruption.
Implications
The findings from this research suggest that incorporating geometric considerations into rollout curation can significantly enhance the robustness of online reinforcement learning for LLMs. This has potential applications in various domains where LLMs are deployed, particularly in scenarios involving noisy or misspecified rewards.
AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing
NLP
Large Language Models
Reinforcement Learning
- AIGP integrates LLM-based reasoning with long-term business value alignment for dynamic pricing.
- The framework employs a Long-Term Value Estimator (LTVE) trained via offline reinforcement learning.
- AIGP achieved a +13.21% increase in GMV and +7.59% in ROI over traditional pricing models.
- The model provides interpretable pricing rationales, enhancing decision transparency.
Read more
AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing
Summary
The paper introduces AIGP, a novel framework designed to enhance dynamic pricing in e-commerce by leveraging Large Language Models (LLMs) for long-term value alignment. Traditional pricing models often lack interpretability and fail to utilize unstructured data effectively, leading to misalignment with long-term business objectives such as Gross Merchandise Value (GMV) and Return on Investment (ROI). AIGP addresses these issues by integrating LLMs with a Long-Term Value Estimator (LTVE) that uses offline reinforcement learning to evaluate pricing actions based on historical data. The framework employs supervised fine-tuning for knowledge distillation, ensuring efficient deployment while maintaining high-quality outputs. AIGP was validated through extensive offline evaluations and large-scale online A/B tests on Tao Factory, demonstrating significant improvements in GMV, ROI, and milestone achievement rates compared to traditional baselines, while also providing interpretable pricing rationales.
Methodology
AIGP utilizes a combination of Large Language Models (LLMs) and a Long-Term Value Estimator (LTVE) trained through offline reinforcement learning. The framework employs supervised fine-tuning for knowledge distillation to ensure efficient deployment. It generates pricing decisions by integrating structured data, domain knowledge, and textual context, while automating preference pair selection for Direct Preference Optimization (DPO).
Results
The deployment of AIGP on Tao Factory resulted in a +13.21% increase in GMV, +7.59% increase in ROI, and +8.20% improvement in milestone achievement rates over a 14-day period compared to the production baseline. The framework also provided interpretable pricing rationales.
Implications
AIGP's approach to dynamic pricing can significantly enhance e-commerce platforms' ability to align pricing strategies with long-term business goals, improve decision transparency, and effectively utilize unstructured data. This framework could be applied to various industries requiring dynamic pricing strategies, leading to more sustainable business practices.
When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence
Multimodal
- Introduces a leakage-safe diagnostic to assess the influence of quality scores on multimodal predictions.
- Finds that permuting reliability scores does not significantly degrade model performance in most cases.
- Demonstrates that quality-aware fusion is effective only when quality signals accurately predict the reliability of modalities.
- Highlights the importance of distinguishing between correlation and causation in multimodal system performance.
Read more
When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence
Summary
This paper investigates the role of quality-aware multimodal fusion in decision-making processes of multimodal systems. The authors propose a novel diagnostic method to determine whether reliability scores of modalities influence model predictions or merely correlate with performance. By permuting reliability scores across test examples while keeping the model and inputs fixed, they assess the impact on prediction accuracy. Experiments conducted on the StressID dataset for stress recognition and the CMU-MOSEI dataset for sentiment analysis reveal that shuffling reliability scores does not significantly affect performance, indicating minimal reliance on these scores during inference. However, in scenarios where quality signals accurately identify the correct modality, substantial performance improvements are observed. This suggests that quality-aware fusion is only beneficial when the quality estimates can reliably predict the correctness of unimodal inputs.
Methodology
The authors propose a diagnostic method that separates modality evidence, availability, and quality signals. They evaluate the model's performance by comparing predictions made with original quality scores (Clean-Q) against those made with shuffled quality scores (Broken-Q). This approach isolates the effects of quality from missingness, allowing for a clear assessment of decision-level dependence on quality signals.
Results
The experiments show that shuffling native quality signals results in negligible changes in performance across the StressID and CMU-MOSEI datasets, despite the potential for improved routing. In contrast, positive controls demonstrate significant performance gaps when quality signals are aligned with unimodal correctness, indicating that the model's decisions rely on the quality estimates only when they accurately reflect modality reliability.
Implications
The findings suggest that multimodal systems should focus on improving the reliability of quality estimates to enhance decision-making. This work provides a framework for evaluating the effectiveness of quality-aware fusion methods and encourages further research into the conditions under which quality signals can be reliably utilized in multimodal contexts.
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Optimization
Theory
- Establishes a linear dependence on mixing time for high-probability PL-SGD, closing the gap with previous quadratic bounds.
- Introduces a lag-blocking argument to derive uniform high-probability guarantees under geometric mixing.
- Extends results to heavy-tailed Markovian gradients with a new clipped block method that addresses Markovian bias.
- Demonstrates optimality of results through matching lower bounds for both light-tailed and heavy-tailed scenarios.
Read more
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Summary
This paper investigates first-order optimization methods for smooth objectives that satisfy the Polyak-Łojasiewicz (PL) condition, particularly in scenarios where gradient samples are generated by an exogenous Markov chain. The authors address a gap in existing high-probability bounds for Stochastic Gradient Descent (SGD) under Markovian noise, where previous results suggested a quadratic dependence on mixing time. By employing a lag-blocking argument, they establish a uniform high-probability guarantee with a leading stochastic term that scales linearly with mixing time. This result is shown to be optimal through a matching lower bound derived from a quadratic objective influenced by a persistent two-state Markov chain. The framework is further extended to handle heavy-tailed Markovian gradients, leading to the design of a clipped block method that effectively mitigates Markovian bias while utilizing all samples. The paper concludes by characterizing the optimal polynomial dependence on mixing time for light-tailed PL-SGD and the heavy-tail exponent in robust regimes, providing significant insights into the behavior of SGD under Markovian conditions.
Methodology
The authors utilize a lag-blocking argument to derive high-probability bounds for SGD under Markovian noise. They analyze the behavior of the algorithm through a combination of stochastic analysis and concentration inequalities, addressing the challenges posed by the adaptive nature of Markovian noise. The study also includes the design of a clipped block method for heavy-tailed gradients, ensuring that all samples are utilized effectively while controlling for bias.
Results
The main results include a uniform high-probability bound for PL-SGD that scales as eO(tmix/(k + K0)), demonstrating linear dependence on mixing time. Additionally, the authors establish a matching lower bound of Ω(σ2tmix/k) for a quadratic objective with a persistent two-state Markov chain. For heavy-tailed gradients, the proposed algorithm achieves a high-probability stochastic error of eO(σ2p(tmix/T)2(p−1)/p), with a corresponding lower bound.
Implications
The findings have significant implications for the design and analysis of optimization algorithms in scenarios where gradient samples are generated by Markov processes. This work enhances the understanding of SGD's performance under Markovian noise, which is relevant in various applications such as decentralized optimization, reinforcement learning, and online system identification.
Graph Neural Networks Applications Across Domains: All Insights You Need
Graph Learning
- GNNs have become the default model for data with relational structures, moving beyond niche applications.
- The paper categorizes twelve application domains, detailing graph construction methods and architecture performance.
- Common challenges across domains include issues with heterophily, temporal graphs, and deployment discrepancies.
- Over-smoothing, robustness, and explainability are highlighted as critical factors for GNN adoption.
Read more
Graph Neural Networks Applications Across Domains: All Insights You Need
Summary
This survey paper provides a comprehensive overview of Graph Neural Networks (GNNs) and their applications across various domains. It emphasizes the evolution of GNNs from a specialized technique to a standard model for relational data. The paper organizes the field around a unified design space, deriving both spectral and spatial formulations from fundamental principles and linking expressive power to the Weisfeiler-Leman hierarchy. The author examines twelve application domains, including recommendation systems, social networks, knowledge graphs, drug discovery, healthcare, computer vision, and more. For each domain, the paper discusses graph construction choices, dominant architectures, and the validity of reported gains. The survey identifies recurring challenges such as heterophily, temporal graph complexities, and the gap between top-performing models in benchmarks versus real-world deployment. Additionally, it addresses critical issues like over-smoothing, robustness, and explainability as constraints influencing GNN adoption. The paper concludes by evaluating the emerging concept of graph foundation models and their integration with language models, suggesting that while promising, the evidence for a paradigm shift remains inconclusive.
Methodology
The paper employs a survey methodology, synthesizing existing literature on GNNs and organizing findings into a coherent framework. It derives mathematical foundations and architectural principles, while also conducting a comparative analysis across various application domains to identify patterns and challenges.
Results
The survey reveals that while GNNs show promise across multiple domains, challenges such as heterophily and temporal dynamics often limit their effectiveness. It also highlights that the architectures that perform best in benchmarks do not always translate to practical deployment success.
Implications
The findings suggest that while GNNs have broad applicability, careful consideration of graph construction and architecture selection is crucial for real-world applications. The insights on challenges and constraints can guide future research and development in GNNs, particularly in enhancing their robustness and explainability.
Heavy-Ball Q-Learning with Residual Weighting Correction
Reinforcement Learning
Theory
Optimization
- Introduces a corrected heavy-ball Q-learning method with theoretical guarantees for faster convergence.
- Utilizes a switched linear system representation to analyze Q-learning dynamics.
- Establishes conditions for acceleration based on the common eigenvector of mean mappings.
- Extends the method to linear function approximation with analogous results.
Read more
Heavy-Ball Q-Learning with Residual Weighting Correction
Summary
This paper introduces a corrected heavy-ball Q-learning method aimed at enhancing the convergence speed of reinforcement learning (RL) algorithms. The author establishes theoretical conditions under which this method converges faster than standard Q-learning. The approach is based on a switched linear system (SLS) representation of Q-learning algorithms, utilizing the joint spectral radius (JSR) of the associated switching families to analyze the dynamics. The proposed method modifies the traditional heavy-ball Q-learning recursion to ensure that the mean mappings share a common eigenvector, which facilitates a tractable analysis and provides a certified rate of convergence. The paper also extends the findings to Q-learning with linear function approximation, maintaining similar convergence and acceleration guarantees. This novel perspective on heavy-ball momentum in Q-learning offers insights into the geometry of the algorithm and its potential for acceleration, particularly in scenarios where the active greedy policy changes along the trajectory. The analysis is primarily deterministic, though it acknowledges the possibility of stochastic implementations.
Methodology
The methodology involves modifying the heavy-ball Q-learning recursion to ensure common eigenvector dynamics, analyzed through a switched linear system framework. The joint spectral radius is employed to derive convergence and acceleration conditions.
Results
The paper demonstrates that the corrected heavy-ball Q-learning method can converge faster than standard Q-learning under specific conditions, providing a theoretical certificate for acceleration. The extension to linear function approximation yields similar convergence guarantees.
Implications
The findings suggest that incorporating heavy-ball momentum into Q-learning can significantly improve learning efficiency in reinforcement learning tasks, particularly in dynamic environments where policies may change frequently. This approach could lead to more effective RL algorithms in practical applications.
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
NLP
Large Language Models
Reinforcement Learning
- Introduces Dedicated Feature Crosscoders (DFC) to isolate RL-specific features for tool use.
- Demonstrates a +31.1% improvement in tool correctness through RL fine-tuning.
- Identifies capability spillover, allowing frozen models to benefit from RL fine-tuning without retraining.
- Shows that steering a single A-exclusive feature can significantly enhance tool-calling performance.
Read more
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Summary
This paper investigates how reinforcement learning (RL) fine-tuning alters the internal representations of language models, specifically focusing on tool use capabilities. The authors introduce Dedicated Feature Crosscoders (DFC) to isolate a compact set of RL-specific features that enhance tool-calling abilities in the Qwen2.5-3B model. Through a systematic hyperparameter sweep involving 48 crosscoder variants, they demonstrate that RL fine-tuning significantly improves tool correctness and allows for a passive transfer of tool-calling abilities to a frozen base model. The study identifies a phenomenon termed 'capability spillover,' where the frozen model's activations can achieve improved tool-correctness without additional fine-tuning. The findings suggest that the DFC architecture effectively concentrates RL-induced capabilities into a minimal feature set, enabling runtime behavioral control of agentic language models. The paper also provides evidence that specific features can be steered to maximize tool-calling performance, indicating a pathway for enhanced interpretability and control in language models.
Methodology
The authors employed a systematic hyperparameter sweep of 48 crosscoder variants, including both standard Crosscoders and DFCs, to evaluate the impact of RL fine-tuning on the Qwen2.5-3B model. They utilized a partitioned dictionary approach to enforce exclusivity among features and applied a training objective that combined mean squared error with sparsity constraints. The study involved training on a dataset of general-domain samples and ToolRL instruction-output pairs.
Results
The results indicate that the DFC architecture successfully isolates RL-induced features, leading to a +31.1% improvement in tool correctness. Additionally, the frozen base model exhibited a +6.8% increase in tool-correctness due to capability spillover. Steering a single A-exclusive neuron resulted in a remarkable +65.0% increase in tool-correctness, demonstrating the effectiveness of feature-level interventions.
Implications
The findings suggest that DFC-based model diffing can be a powerful tool for identifying and modulating representations introduced by RL fine-tuning. This has significant implications for mechanistic interpretability and the development of agentic language models capable of controlled behavior in real-time applications.
Asymptotically Optimal Learning for Parametric Prophet Inequalities
Theory
Optimization
- Characterization of optimal full-information asymptotic competitive ratios for parametric families.
- Development of a confidence-based dynamic-programming policy for online learning.
- Achieving optimal competitive ratios using only online observations without offline samples.
- Derivation of distribution-specific convergence rates for various reward distributions.
Read more
Asymptotically Optimal Learning for Parametric Prophet Inequalities
Summary
This paper investigates learning in prophet inequalities where rewards are drawn from an exponential-type parametric family with an unknown parameter θ. The authors first characterize the optimal full-information asymptotic competitive ratio for this family, revealing that in the unbounded-support case, the limit is governed by the endpoint-growth parameter, while in the bounded-support case, it converges to 1. They propose a confidence-based dynamic-programming policy for online learning that estimates the unknown distribution parameter from online observations alone, achieving the same optimal asymptotic competitive ratio as if the parameter were known. The paper also derives distribution-specific convergence rates for canonical examples, demonstrating that the proposed policy matches the convergence rates of full-information optimal policies for exponential and bounded-support distributions, and provides a Pareto-specific convergence guarantee for heavy-tailed rewards. Numerical experiments validate the performance of the algorithm, showcasing its effectiveness in achieving asymptotic optimality without requiring offline samples.
Methodology
The authors propose a confidence-based dynamic-programming policy that first estimates the unknown distribution parameter through an exploration phase. It constructs upper confidence bounds and applies dynamic programming thresholds based on these estimates. This approach leverages the parametric structure of the reward distributions to achieve optimal performance.
Results
The paper establishes the asymptotic competitive ratios for both unbounded and bounded-support cases, with the unbounded case yielding a specific limit based on the endpoint-growth parameter. The proposed online learning policy achieves these optimal ratios using only online data, and the convergence rates for specific distributions (exponential, Pareto, and bounded-support) match those of full-information policies.
Implications
The findings suggest that it is possible to effectively learn and make optimal decisions in online settings with unknown reward distributions, which has significant implications for applications in areas such as online advertising, pricing strategies, and labor market dynamics.
A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets
Efficient ML
Time Series
Theory
- Introduces a multi-fidelity transfer learning framework for GWSHM.
- Utilizes lightweight physics-based simulations to generate synthetic datasets.
- Achieves superior damage localization and sizing accuracy compared to CNN-based methods.
- Demonstrates strong generalization capabilities on unseen data.
Read more
A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets
Summary
This paper presents a novel multi-fidelity transfer learning framework aimed at enhancing guided wave-based structural health monitoring (GWSHM) for damage diagnosis in engineering structures. The authors address the challenge of limited labeled experimental data and the high computational cost associated with generating extensive high-fidelity simulation datasets. The proposed framework integrates lightweight physics-based simulations, convolutional autoencoder (CAE)-based deep feature learning, and a feed-forward neural network to accurately localize and size damage in plate-like structures equipped with piezoelectric transducers. A one-dimensional time-domain spectral element model is utilized to create a large synthetic dataset for pretraining, while transfer learning is employed to adapt the model to experimental data using minimal labeled samples. The results demonstrate that the CAE-based transfer learning framework significantly outperforms traditional CNN-based approaches in terms of damage localization accuracy, achieving R² scores above 0.93 for localization and 0.99 for sizing. The model also exhibits strong generalization capabilities, maintaining high predictive accuracy on unseen data, thus establishing the framework as a practical solution for real-world GWSHM applications.
Methodology
The methodology involves the integration of a one-dimensional time-domain spectral element model for generating synthetic datasets, a convolutional autoencoder for deep feature learning, and a feed-forward neural network for damage localization and sizing. Transfer learning is applied to adapt the model to experimental data with limited labeled samples.
Results
The proposed framework achieved R² scores exceeding 0.93 for damage localization and 0.99 for damage sizing, indicating excellent predictive performance. The model demonstrated high accuracy on previously unseen data, confirming its robustness and effectiveness in real-world scenarios.
Implications
The findings suggest that the multi-fidelity transfer learning framework can significantly enhance the accuracy and efficiency of damage diagnosis in structural health monitoring, making it a viable solution for practical engineering applications where data is limited.
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Generative Models
Theory
Optimization
- Introduces effective covariance dynamics for multi-feature GANs with structured latent covariance.
- Establishes a solvable region in learning-rate and noise space for successful GAN training.
- Demonstrates a signal-boosting mechanism where weak coordinates can be lifted above the learnability threshold.
- Validates theoretical findings through numerical simulations and empirical experiments on benchmark datasets.
Read more
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Summary
This paper investigates a solvable high-dimensional model of generative adversarial networks (GANs) where a linear generator learns from data characterized by structured latent covariance. The authors extend previous analyses of GANs by considering class-dependent, correlated, and non-zero-mean latent structures, moving beyond the conventional assumption of diagonal latent covariance. They demonstrate that the dynamics of the training process can be captured by deterministic ordinary differential equations (ODEs) governed by an effective covariance matrix. The study reveals that learning begins when the leading effective eigenvalue crosses a threshold, while full recovery requires all effective modes to remain within a specific interval determined by learning rates and noise levels. A notable finding is the signal-boosting mechanism, where low-rank correlations can enhance weak directions, making them learnable, while excessively strong correlations can destabilize the recovery process. The authors validate their theoretical findings through numerical simulations and experiments on datasets such as MNIST, FashionMNIST, and CIFAR-10, showing that informed generator covariance significantly improves alignment with the data-driven reference subspace.
Methodology
The authors derive high-dimensional macroscopic ordinary differential equations (ODEs) for GAN training, incorporating class-dependent and correlated latent structures. They analyze the stability of the training dynamics through eigenvalue analysis of the effective covariance matrix and validate their findings with numerical simulations and experiments on standard datasets.
Results
The study finds that the dynamics of GAN training can be effectively modeled using ODEs that account for structured latent covariance. The stability analysis reveals specific conditions under which learning can commence and be sustained, highlighting the importance of the effective eigenvalue in the learning process. Numerical simulations align with the theoretical predictions, and experiments demonstrate improved performance in recovering data-driven subspaces with informed generator covariance.
Implications
This work has significant implications for the design and training of GANs, particularly in scenarios involving complex data distributions with correlated features. The insights into covariance dynamics can inform better training strategies and improve the performance of GANs in generating high-quality, class-conditional samples.
How Good Can Linear Models Be for Time-Series Forecasting?
Time Series
- Ridge regression, when carefully tuned, can outperform complex models like transformers and MLPs in time-series forecasting.
- Optimal lookback periods are highly dataset-specific and often non-monotonic with respect to forecast horizons.
- Local normalization strategies consistently yield better forecasting accuracy than global normalization.
- The study reveals that different time series within the same dataset may require distinct hyperparameter settings.
Read more
How Good Can Linear Models Be for Time-Series Forecasting?
Summary
This paper challenges the prevailing notion that larger model architectures, such as transformers, are necessary for effective time-series forecasting. Instead, the authors argue that significant improvements can be achieved through careful preprocessing and hyperparameter tuning of simpler models, specifically Ridge regression. The study systematically investigates the effects of context length, normalization strategies, regularization, and data augmentation across eight standard benchmarks. Key findings include the dataset-specific nature of optimal lookback periods, the advantage of local normalization over global normalization, and the variability of optimal hyperparameters across different time series within the same dataset. The optimized Ridge regression models outperform previous linear forecasting methods and exceed the performance of transformer and MLP baselines on six out of eight benchmarks. The authors also introduce SearchCast, a reproducible pipeline for hyperparameter search, which can aid future research in time-series forecasting.
Methodology
The authors employed Ridge regression as a testbed for their experiments, conducting a systematic hyperparameter search over context length, normalization strategies, regularization, and data augmentation across eight standard time-series forecasting benchmarks. This approach allowed them to analyze the impact of preprocessing choices on forecasting performance.
Results
The optimized Ridge regression models achieved superior performance compared to prior linear forecasting methods and surpassed transformer and MLP baselines on six out of eight datasets. The study identified that the optimal lookback period is often non-monotonic and varies significantly across different datasets, challenging conventional assumptions about time-series forecasting.
Implications
The findings suggest that simpler models, when properly tuned, can be highly effective for time-series forecasting, potentially reducing the need for complex architectures. This has implications for resource allocation in model training and the design of future forecasting systems. Additionally, the insights gained from hyperparameter optimization can inform the development of more sophisticated models.
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Reinforcement Learning
Robotics
Theory
- Integration of reinforcement learning with biochemical reaction dynamics for modeling phototaxis.
- Formulation of phototaxis as a subjective POMDP to address sensory ambiguity.
- Use of inverse reinforcement learning to derive phototactic policies from experimental data.
- Demonstration that tumbling serves as an information-acquisition strategy rather than random noise.
Read more
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Summary
This paper explores the integration of reinforcement learning (RL) within the context of chemical reaction networks (CRNs) to model phototaxis in unicellular algae. The authors propose a framework that connects a Partially Observable Markov Decision Process (POMDP) with biochemical reaction dynamics, allowing for the modeling of navigation as an information-driven sensorimotor process. The study emphasizes that organisms, such as Chlamydomonas, utilize a run-tumble mechanism not merely as a random motor noise but as a strategy for resolving sensory ambiguity and optimizing reward through active exploration. By employing inverse reinforcement learning (IRL) on 30 recorded trajectories, the authors infer a behavioral objective consistent with observed phototactic behavior and benchmark their model against standard stochastic simulation algorithms. The results demonstrate that the proposed model effectively reproduces empirical data on alignment-to-light distributions, highlighting the role of tumbling as an adaptive information-seeking behavior in cellular navigation. This work bridges the gap between statistical models of inference and biochemical processes, suggesting that intracellular networks can implement components of adaptive behavior.
Methodology
The authors formulated a POMDP to model phototaxis, incorporating a memoryless Bayesian update mechanism to handle hidden environmental variables. They utilized inverse reinforcement learning on experimental trajectories to infer behavioral objectives and compared the dynamics of their model with standard stochastic simulation algorithms using alignment-distribution diagnostics. The implementation was realized through CRN-ODEs to simulate the internal dynamics of the model.
Results
The proposed model successfully reproduced the empirical alignment-to-light distribution observed in Chlamydomonas trajectories, demonstrating a close match with the dynamics derived from standard stochastic simulation algorithms. The findings indicate that the run-tumble alternation is an effective strategy for information acquisition, supporting the hypothesis that biochemical networks can facilitate adaptive behavior in navigation.
Implications
This research provides insights into how simple biochemical processes can underpin complex adaptive behaviors in living organisms. It suggests potential applications in synthetic biology and the design of bio-inspired algorithms for navigation and exploration tasks in robotics and artificial intelligence.
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
Reinforcement Learning
Large Language Models
Optimization
- RiVER enables training of LLMs without ground-truth solutions using score-based optimization.
- The framework addresses challenges of scale and frequency dominance in reinforcement learning.
- Significant performance improvements were observed in both ALE ratings and exact-solution benchmarks.
- Calibrated reward shaping enhances the effectiveness of feedback in training LLMs.
Read more
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
Summary
This paper introduces the Ranking-induced VERifiable framework (RiVER) for training Large Language Models (LLMs) using reinforcement learning without the need for ground-truth solutions. Traditional Reinforcement Learning with Verifiable Rewards (RLVR) relies on known answers to assign rewards, which limits its effectiveness in scenarios where optimal solutions are unknown or computationally intractable. RiVER addresses this limitation by utilizing deterministic execution feedback as continuous-valued supervision for score-based optimization tasks. The authors identify two main challenges in applying group-relative RL to continuous rewards: scale dominance, where uncalibrated score magnitudes distort policy updates, and frequency dominance, where frequently sampled suboptimal solutions overshadow rare high-quality candidates. RiVER mitigates these issues through calibrated reward shaping that emphasizes top-ranked solutions while maintaining bounded feedback for other valid outputs. The framework was tested on 12 AtCoder Heuristic Contest tasks and evaluated on various benchmarks, leading to significant improvements in LLM performance. Notably, RiVER enhanced the performance of Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE ratings, respectively, and also improved performance on exact-solution benchmarks like LiveCodeBench and USACO by 2.4% and 3.5% on average. These findings suggest that score-based optimization tasks, when combined with proper reward calibration, can effectively train LLMs in the absence of ground-truth solutions.
Methodology
RiVER employs a ranking-induced approach to reinforcement learning, utilizing deterministic execution feedback for training LLMs. It incorporates instance-wise comparisons to mitigate scale dominance and applies winner-heavy reward shaping to emphasize high-quality solutions while providing bounded feedback for other candidates.
Results
RiVER improved the ALE ratings of Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4%, respectively, and achieved average improvements of 2.4% and 3.5% on exact-solution benchmarks such as LiveCodeBench and USACO, demonstrating the effectiveness of score-based optimization in training LLMs.
Implications
The findings suggest that reinforcement learning frameworks can be adapted for tasks lacking ground-truth solutions, potentially broadening the applicability of LLMs in various optimization and algorithm design scenarios.
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
Theory
Optimization
Multimodal
- Introduces a theoretical framework for understanding scaling laws in contrastive learning.
- Derives a risk decomposition that clarifies the contributions of approximation, optimization, and sampling errors.
- Demonstrates that contrastive learning requires learning interactions between two views, affecting scaling behavior.
- Provides an explicit scaling law related to sketch dimension, sample size, and optimization horizon.
Read more
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
Summary
This paper investigates scaling laws in contrastive representation learning, specifically through a sketched linear model under a paired Gaussian latent-variable framework. The authors derive a risk decomposition that includes irreducible risk, approximation error, gradient descent (GD) bias, GD variance, and a cross term, which is controlled by bias and variance but does not affect the upper-bound scaling. The main theorem presents an explicit scaling law concerning sketch dimension, sample size, and effective optimization horizon. The findings indicate that the contrastive learning paradigm requires learning interactions between two views, which alters the scaling behavior of optimization and finite-sample noise compared to standard linear regression. This work provides foundational theoretical insights into the scaling behavior of contrastive learning, offering guidance for balancing model size, data, and computational resources.
Methodology
The authors analyze a Gaussian paired-view model and a bilinear contrastive score trained on sketched inputs. They utilize a Gaussian-negative quadratic contrastive surrogate to maintain the core contrastive structure while allowing for analytical tractability. The study employs empirical gradient descent for training and focuses on understanding the scaling behavior through a controlled Gaussian framework.
Results
The main results include the derivation of a scaling law that describes how risk decomposes into stable power-law terms influenced by model size, data size, and compute. The findings reveal that optimization and variance terms are shaped by the interactions between two views, which is a novel aspect compared to traditional linear regression.
Implications
The theoretical insights from this study can guide future research in contrastive learning, particularly in optimizing model architectures and training strategies. The scaling laws can help practitioners forecast the benefits of additional compute and data, ultimately improving the efficiency of training contrastive models.
Learning Probabilistic Filters with Strictly Proper Scoring Rules
Theory
Time Series
Optimization
- Introduction of the Proper Scoring Ensemble Filter (PSEF) for Bayesian filtering.
- Utilizes transformer-based architecture for analysis mapping of forecast ensembles.
- Training based on strictly proper scoring rules enhances probabilistic accuracy.
- Theoretical foundation proves minimization of population objective aligns with true filtering distribution.
Read more
Learning Probabilistic Filters with Strictly Proper Scoring Rules
Summary
This paper introduces the Proper Scoring Ensemble Filter (PSEF), a novel ensemble data assimilation method designed to learn the Bayesian filtering distribution from synthetic state-observation trajectories. The PSEF addresses the challenge of approximating the true filtering distribution, which is often unavailable for supervised learning, by utilizing a transformer-based analysis map that processes forecast ensembles and observations. The training employs strictly proper scoring rules, specifically the energy score, to ensure that the model is rewarded for probabilistic accuracy across the entire distribution rather than just the ensemble mean. The authors provide a theoretical foundation for the PSEF, proving that under certain assumptions, the population objective is minimized by the true Bayesian filtering distribution. They also derive the finite-ensemble empirical objective used in training and relate it to the population objective. Numerical experiments demonstrate that the PSEF effectively approximates complex filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, outperforming classical and other learning-based methods. The findings suggest that for Gaussian problems, a correction to the Ensemble Kalman Filter (EnKF) is beneficial, while for highly non-Gaussian scenarios, an end-to-end approach is superior.
Methodology
The PSEF employs a permutation-invariant, transformer-based analysis map to approximate the filtering distribution. The training is based on strictly proper scoring rules, specifically the energy score, allowing the model to learn from synthetic state-observation trajectories generated by the forecast model. The authors establish a theoretical framework linking the population objective to the empirical training objective.
Results
The PSEF successfully approximates challenging filtering distributions, including nonlinear and multi-modal posteriors, demonstrating stronger performance in data assimilation tasks compared to classical methods and other learning-based approaches. The results indicate that the PSEF is particularly effective for non-Gaussian problems, while also providing benefits for Gaussian scenarios when corrections to the EnKF are applied.
Implications
The PSEF framework has significant implications for data assimilation in various fields, including meteorology and dynamical systems, where accurate state estimation from noisy observations is crucial. The ability to learn from synthetic data opens new avenues for applying machine learning in complex, real-world scenarios where ground-truth filtering distributions are not available.
Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting
Efficient ML
Time Series
- Otter Weather democratizes high-performance weather forecasting by significantly reducing training costs.
- The deterministic model outperforms traditional NWP by 9.6% at a 24-hour lead time while using less than 3.5 A100-days for training.
- Otter-XL achieves a 9.7% improvement in probabilistic forecasting over the IFS ENS baseline with a 30 A100-day budget.
- The model demonstrates a 100-fold reduction in compute compared to resource-intensive architectures.
Read more
Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting
Summary
The paper presents Otter Weather, a novel spatiotemporal forecasting model aimed at improving medium-range weather forecasting through enhanced efficiency and accessibility. Traditional Numerical Weather Prediction (NWP) models, while accurate, require extensive computational resources, limiting their use. Otter Weather addresses this by utilizing a 2D Swin-UNet Transformer architecture that significantly reduces training costs while maintaining high predictive skill. The deterministic version of Otter outperforms the best NWP baseline by 9.6% at a 24-hour lead time, requiring less than 3.5 A100-days for training. This represents a 2x efficiency gain over existing lightweight AI models and a 100-fold reduction in compute compared to more resource-intensive architectures. The paper also extends these findings to probabilistic forecasting, where Otter-XL achieves a 9.7% improvement in Continuous Ranked Probability Score (CRPS) over the IFS ENS baseline, demonstrating nearly double the predictive skill of comparable models at similar compute budgets. The model's efficiency and performance suggest its applicability beyond weather forecasting, potentially benefiting various scientific domains.
Methodology
The authors developed Otter Weather using a 2D Swin-UNet Transformer architecture, leveraging modern advancements from language modeling and optimizing training with the Muon optimizer. The model was evaluated on ERA5 reanalysis data at 1.5° resolution using standard WeatherBench protocols.
Results
The deterministic Otter model outperformed the best NWP baseline by 9.6% at a 24-hour lead time, requiring less than 3.5 A100-days for training. Otter-XL achieved a 9.7% improvement in CRPS over the IFS ENS baseline, demonstrating nearly double the predictive skill of lightweight models at similar compute budgets.
Implications
The advancements made by Otter Weather could democratize access to high-performance weather forecasting, enabling under-resourced institutions to develop competitive models. Additionally, the model's efficiency may facilitate rapid iteration and deployment in various scientific fields.
Finding the Time to Think: Learning Planning Budgets in Real-Time RL
Reinforcement Learning
- Introduction of variable-delay real-time RL, allowing agents to choose deliberation time based on state.
- Development of a lightweight gating policy that selects state-dependent planning budgets.
- Empirical characterization of the trade-off between planning quality and inference time.
- Demonstration of the gating policy's effectiveness across multiple real-time environments.
Read more
Finding the Time to Think: Learning Planning Budgets in Real-Time RL
Summary
This paper addresses the challenges of decision-making in real-time reinforcement learning (RL) environments, where the environment progresses while the agent deliberates. Traditional RL assumes that the agent can take an indefinite amount of time to decide, but in real-time scenarios, this is not feasible as the environment continues to evolve. The authors introduce a novel framework called variable-delay real-time RL, which allows the agent to choose how long to deliberate based on the current state. They propose a lightweight gating policy that operates on top of a frozen planner, enabling the agent to adaptively select planning budgets at each decision point. This approach effectively balances the trade-off between planning quality and inference time, addressing the meta-reasoning challenges inherent in real-time decision-making. The authors validate their method across various environments, demonstrating that their gating policy outperforms fixed-budget and heuristic baselines, and successfully transfers to a real-time hardware setup without retraining.
Methodology
The authors employ a gating policy trained using Proximal Policy Optimization (PPO) on top of a frozen AlphaZero planner. This policy determines how long the agent should deliberate at each decision point, allowing for adaptive planning in real-time environments. The methodology includes empirical evaluations in various games such as Pac-Man, Tetris, and Snake, as well as clock-based environments.
Results
The gating policy significantly outperformed fixed-budget and heuristic baselines across five different environments. The empirical results confirmed that the adaptive selection of planning budgets leads to better performance in real-time decision-making scenarios. Additionally, the learned policy was successfully transferred to a real-time hardware setup, demonstrating its robustness and effectiveness.
Implications
This research has implications for developing more efficient RL agents capable of operating in real-time environments, such as robotics, gaming, and any application where timely decision-making is critical. The ability to adaptively allocate planning resources can enhance the performance of agents in dynamic settings.
Transformer-Based Classification of Bacterial Raman Spectra with LOOCV
Theory
- Transformer models significantly outperform conventional machine learning methods in classifying bacterial Raman spectra.
- The study utilized a nested leave-one-replicate-out cross-validation framework for rigorous model evaluation.
- Transformers demonstrated robust performance on raw Raman spectra without preprocessing.
- Improved class separation was observed in the latent feature space learned by the transformer model.
Read more
Transformer-Based Classification of Bacterial Raman Spectra with LOOCV
Summary
This study investigates the application of transformer-based models for the classification of bacterial Raman spectra, utilizing a nested leave-one-replicate-out cross-validation (LOOCV) framework. The research compares the performance of the transformer model against conventional machine learning methods, including PCA and ICA combined with classifiers such as LDA, SVM, and Random Forest. A dataset comprising 5,417 single-cell Raman spectra from six bacterial species was employed, with measurements taken across nine independent replicates. The results demonstrate that the transformer model consistently outperformed traditional methods, achieving superior classification accuracy and improved class separation in the learned latent feature space. Notably, the transformer maintained its performance even when applied directly to raw Raman spectra without preprocessing, indicating its robustness across different measurement replicates. The findings underscore the potential of transformer-based approaches in Raman spectral classification and highlight the necessity of replicate-aware validation for realistic model evaluation.
Methodology
The study employed a nested leave-one-replicate-out cross-validation framework to evaluate the transformer model's performance against traditional methods. The dataset consisted of single-cell Raman spectra from six bacterial species, with each spectrum represented by 584 Raman shift variables. Various conventional machine learning techniques, including PCA and ICA combined with LDA, SVM, and Random Forest classifiers, were also utilized for comparison.
Results
The transformer model achieved the highest classification performance across independent test replicates, significantly outperforming all conventional approaches. It also showed improved class separation in the latent feature space compared to PCA and ICA-based representations. The model's ability to maintain performance on raw spectra without preprocessing further demonstrated its robustness.
Implications
The findings suggest that transformer-based models can enhance the classification of Raman spectra in biomedical applications, potentially leading to advancements in bacterial identification and diagnostics. The emphasis on replicate-aware validation could improve the reliability of machine learning models in practical applications.
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Theory
Optimization
NLP
- Introduction of the Red Queen Gödel Machine (RQGM) for recursive self-improvement with evolving evaluators.
- Controlled utility evolution allows dynamic adaptation of evaluation criteria across epochs.
- Empirical results show RQGM improves performance in coding tasks, scientific writing, and proof grading.
- Co-evolved evaluators provide cheaper and more effective evaluation signals compared to static benchmarks.
Read more
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Summary
The paper introduces the Red Queen Gödel Machine (RQGM), an innovative framework for self-improving agents that incorporates the concept of co-evolving evaluators. Unlike traditional self-improvement methods that rely on static evaluation criteria, the RQGM allows for dynamic evaluation that evolves alongside the agents. This is achieved through controlled utility evolution, where the search process is divided into epochs with fixed evaluation criteria within each epoch, while allowing the utility to change at epoch boundaries. The authors demonstrate the effectiveness of the RQGM through empirical studies in three domains: coding tasks, scientific paper writing and reviewing, and proof writing and grading. The results show that the RQGM outperforms previous state-of-the-art methods by utilizing a co-evolved code reviewer, leading to improved test pass rates and reduced token usage. Furthermore, in the context of paper writing, co-evolved agents achieved significantly higher acceptance rates, while co-evolved graders demonstrated improved accuracy in evaluations. The findings suggest that integrating evolving evaluators into the self-improvement loop can enhance performance and adaptability in complex tasks.
Methodology
The RQGM framework organizes the self-improvement process into epochs, where a fixed evaluator assesses agents within each epoch, while allowing for the utility to evolve at epoch boundaries. Co-evolution of evaluators is implemented, where evaluators improve alongside the task agents they guide. The framework is empirically tested across coding, paper writing, and proof grading tasks.
Results
The RQGM achieved a held-out pass rate of 71.7% in coding tasks, surpassing the previous state-of-the-art of 69.9%. In scientific paper writing, co-evolved agents increased acceptance rates from 21.8% to 40.5%. Co-evolved graders also showed a 9% increase in ground-truth accuracy compared to static baselines.
Implications
The RQGM framework has significant implications for the development of self-improving agents in various domains, particularly in tasks that lack direct benchmarks. It opens avenues for more adaptive and efficient evaluation methods in machine learning, potentially enhancing automated scientific discovery and reasoning tasks.