AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
62
Papers today
8h
Update frequency
7
Days of history
Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
NLP
Large Language Models
Optimization
- Text-to-optimization requires both modeling and binding capabilities, with binding identified as the primary bottleneck.
- Text2Opt-Bench is introduced as a comprehensive benchmark for evaluating text-to-optimization models across diverse problem categories.
- The BIND method significantly improves model performance by allowing programmatic binding of data.
- Training binding-specific models outperforms traditional end-to-end approaches, highlighting the importance of focused training on binding tasks.
Read more
Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
Summary
This paper addresses the challenges in text-to-optimization tasks, which require two distinct capabilities: modeling and binding. While modeling involves selecting the appropriate optimization structure, binding pertains to grounding every coefficient and parameter in the problem data. The authors introduce Text2Opt-Bench, a scalable benchmark that encompasses a variety of solver-verified optimization problems across 12 categories, including linear and nonlinear formulations. The study reveals that as instance data increases, the accuracy of models significantly declines, a phenomenon termed the 'effective binding limit.' To mitigate this issue, the authors propose BIND, an inference-time approach that externalizes numeric data to structured files, allowing models to bind data programmatically. This method notably enhances the performance of models like GPT-5-Nano and GPT-5, achieving accuracy improvements from 59.1% to 82.4% and 86.2% to 95.8%, respectively. Furthermore, the paper demonstrates that training models specifically for binding yields superior results compared to traditional end-to-end supervised fine-tuning and reinforcement learning, establishing that binding is a critical bottleneck in text-to-optimization tasks.
Methodology
The authors developed Text2Opt-Bench, a scalable benchmark of verified optimization problems, and evaluated over 10 models using this benchmark. They introduced BIND, an inference-time method that externalizes numeric data to structured files. Additionally, they conducted experiments to compare the performance of binding-specific models against traditional supervised fine-tuning and reinforcement learning approaches.
Results
The introduction of BIND led to significant accuracy improvements for models, with GPT-5-Nano's accuracy rising from 59.1% to 82.4% and GPT-5's from 86.2% to 95.8%. The study also found that a 1.5B parameter binding specialist model outperformed a 7B parameter end-to-end model across various optimization categories.
Implications
The findings suggest that enhancing binding capabilities in text-to-optimization models can lead to more effective solutions in operations research and industrial decision-making. The development of specialized models for binding may pave the way for improved performance in complex optimization tasks.
ASAP: Attention Sink Anchored Pruning
Computer Vision
Efficient ML
Multimodal
- ASAP leverages the attention sink as a geometric anchor for efficient token reduction.
- The method employs Radial Diffusion Clustering based on diffusion distances to the attention sink.
- ASAP achieves state-of-the-art performance without the need for fine-tuning.
- The framework addresses the limitations of existing token pruning methods that rely on local metrics.
Read more
ASAP: Attention Sink Anchored Pruning
Summary
The paper introduces ASAP (Attention Sink Anchored Pruning), a novel framework designed to address the computational challenges faced by Vision Transformers (ViTs) due to the quadratic complexity of self-attention mechanisms at high resolutions. Existing token reduction methods often rely on local metrics, which can lead to the preservation of uninformative tokens while discarding salient features, a phenomenon termed 'attention sink.' ASAP proposes a training-free approach that models the information flow in ViTs as a Lazy Random Walk, identifying the attention sink as a significant accumulator of probability mass. By calculating the diffusion distance to this sink within a cumulative transition matrix, ASAP effectively partitions tokens using Radial Diffusion Clustering and reduces background redundancy through Transition Weight Pooling. The framework demonstrates superior performance across various tasks, including image, video, and vision-language tasks, achieving up to 48% acceleration in throughput while maintaining or exceeding baseline accuracy.
Methodology
ASAP models the information flow of ViTs as a Lazy Random Walk, treating the attention sink as a mass-accumulating attractor. It computes the diffusion distance to the sink from a cumulative transition matrix to perform Radial Diffusion Clustering, allowing for a single-shot token reduction that compresses background redundancy while preserving important foreground structures.
Results
ASAP outperforms existing training-free methods on benchmark datasets such as ImageNet-1K, Kinetics-400, and LLaVA, achieving throughput improvements of up to 48% while maintaining or exceeding baseline accuracy.
Implications
The proposed framework has significant implications for improving the efficiency of Vision Transformers, making them more viable for high-resolution tasks in real-time applications. It also opens avenues for further research into token reduction strategies that leverage global information structures.
Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Theory
- Introduces the Expectation Consistency Condition for confidence calibration under covariate shifts.
- Proposes Expectation Consistency Loss (ECL) for robust confidence calibration across different scenarios.
- Demonstrates that ECL maintains sample complexity comparable to existing calibration methods.
- Validates the proposed method on various datasets, showcasing its effectiveness in real-world applications.
Read more
Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Summary
This paper addresses the challenge of confidence calibration in classification models under covariate shift, a common scenario where the training and test data distributions differ. Traditional calibration methods assume independent and identically distributed (i.i.d.) data, which limits their effectiveness in real-world applications. The authors introduce the Expectation Consistency Condition, which provides a necessary and sufficient condition for confidence calibration under covariate shifts, revealing that such shifts do not inherently lead to miscalibration. Building on this condition, they propose the Expectation Consistency Loss (ECL), an unsupervised domain adaptation loss that supports various calibration types, including canonical, class-wise, and top-label calibration. The paper also establishes that ECL has comparable sample complexity to existing methods like Expected Calibration Error (ECE) and presents a mini-batch training scheme for efficient implementation. The effectiveness of ECL is validated through experiments on both simulated and real-world datasets exhibiting covariate shifts.
Methodology
The authors derive the Expectation Consistency Condition to identify the requirements for confidence calibration under covariate shifts. They then develop the Expectation Consistency Loss (ECL) as an unsupervised domain adaptation loss, compatible with multiple calibration strategies. A mini-batch training scheme is also introduced to facilitate efficient computation of ECL.
Results
The proposed ECL method was validated on both simulated and real-world datasets, demonstrating improved confidence calibration performance compared to traditional methods, particularly in scenarios involving covariate shifts.
Implications
The findings suggest that confidence calibration can be effectively achieved even under distribution shifts, which is crucial for applications in safety-critical fields such as medical diagnosis and autonomous systems. The ECL method can enhance the reliability of machine learning models in real-world applications where data distributions may vary.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Reinforcement Learning
Large Language Models
Optimization
- Identifies hard clipping as a major source of instability in RLVR training.
- Proposes Near-boundary Stochastic Rescue (NSR) to recover lost signals beyond clipping thresholds.
- Demonstrates that NSR outperforms deterministic gradient decay methods.
- Validates the approach across various model sizes (7B to 30B parameters) and architectures.
Read more
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Summary
This paper addresses the challenges of training stability and convergence in Reinforcement Learning with Verifiable Rewards (RLVR), particularly focusing on the limitations imposed by hard clipping mechanisms in GRPO-style objectives. The authors identify that the rigid clipping decisions lead to the loss of informative signals that lie just beyond the clipping threshold, which contributes to instability during training. To mitigate this issue, they propose a novel approach called Near-boundary Stochastic Rescue (NSR), which stochastically retains tokens that fall slightly outside the clipping boundary. This method transforms the binary clipping decision into a probabilistic process, allowing for the recovery of valuable learning signals that would otherwise be discarded. The authors validate NSR through extensive experiments across various model sizes and architectures, demonstrating that it significantly enhances training stability and outperforms existing strong baselines such as DAPO and GSPO. The findings suggest that addressing the clipping decision is crucial for improving the performance of RLVR setups, and NSR serves as a minimal yet effective plug-and-play solution.
Methodology
The authors conducted a systematic analysis of GRPO-style clipping objectives to diagnose the issues caused by hard clipping. They proposed NSR, which employs stochastic sampling to retain tokens near the clipping boundary, thus allowing for the recovery of informative signals. The effectiveness of NSR was evaluated through extensive experiments involving various model sizes and architectures.
Results
The implementation of NSR led to substantial improvements in training stability and performance, consistently outperforming strong baselines like DAPO and GSPO. The experiments confirmed that the stochastic mechanism of NSR is more effective than deterministic approaches, providing robust optimization across different setups.
Implications
The findings of this paper have significant implications for the development of more stable and effective RLVR systems, particularly in scaling reasoning capabilities of large language models. The NSR approach can be integrated into existing RLVR frameworks to enhance their performance and reliability.
An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion
Multimodal
- Introduction of a formal evidence hierarchy for sensor fusion in CBRNE threat classification.
- Integration of OSINT data to enhance contextual evidence in the classification process.
- Development of a Bayesian maximum a posteriori (MAP) classifier that utilizes all levels of evidence.
- Demonstrated robustness against clutter and prior mismatch in simulated scenarios.
Read more
An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion
Summary
This paper addresses the challenges of detecting, localizing, and classifying CBRNE (chemical, biological, radiological, nuclear, and explosive) threats through a novel approach to heterogeneous sensor fusion. The authors propose a context-aware, domain knowledge-enhanced Bayesian sensor fusion method that incorporates an evidence hierarchy to improve classification accuracy. The evidence hierarchy categorizes information into three levels: direct evidence (intrinsic object characteristics), indicative evidence (correlational evidence), and contextual evidence (environmental cues). By integrating open-source intelligence (OSINT) data, the proposed method enhances the fusion process, allowing for more robust classification in cluttered environments. The methodology is evaluated through simulated scenarios, demonstrating significant improvements in classification accuracy, achieving up to 95%. This work highlights the importance of contextual information in multi-sensor fusion systems and provides a structured framework for future research in CBRNE threat classification.
Methodology
The authors developed an evidence hierarchy that categorizes evidence into direct, indicative, and contextual levels. They integrated OSINT inputs to provide contextual information, which was then used in a Bayesian MAP classifier to improve threat type classification. The methodology was evaluated through simulations to assess its effectiveness in various operational scenarios.
Results
The proposed fusion approach demonstrated a classification accuracy of up to 95% in simulated environments, showcasing its robustness to high clutter rates and prior mismatch. The integration of contextual evidence significantly improved the performance of the classification system compared to traditional methods.
Implications
The findings suggest that incorporating contextual information through OSINT can greatly enhance the performance of multi-sensor fusion systems in CBRNE threat detection and classification. This approach could be applied in various security and defense applications, improving situational awareness and response capabilities in complex environments.
ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models
Efficient ML
Theory
- ARC-STAR introduces a three-stage correction framework for PDE foundation models, focusing on spatial error concentration.
- The framework preserves the pretrained model without fine-tuning, making it auditable and budget-aware.
- Empirical results show that ARC-STAR reduces velocity rollout error by at least 36Γ compared to raw predictions across all tested regimes.
Read more
ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models
Summary
The paper presents ARC-STAR, a novel framework for correcting predictions made by frozen partial differential equation (PDE) foundation models, which often exhibit significant errors when deployed in unfamiliar flow regimes. The authors identify that errors in predictions are spatially concentrated rather than uniformly distributed, which challenges the common approach of applying uniform corrections across the entire field. ARC-STAR addresses this by organizing the correction process into three stages: a global corrector to mitigate broad solver bias, a blockwise local refiner to address residual errors, and a label-free scoring mechanism to prioritize high-risk areas during deployment. This framework is designed to be auditable, allowing for separate evaluation of each correction stage, and budget-aware, enabling targeted refinement without the need for retraining the underlying model. The empirical findings demonstrate that ARC-STAR significantly reduces prediction errors across various benchmarks, outperforming existing methods and providing insights into the spatial characteristics of prediction errors in PDE models.
Methodology
The ARC-STAR framework consists of a global corrector that addresses broad biases in predictions, followed by a blockwise local refiner that focuses on high-error regions. The correction process is structured to operate on a frozen PDE model, with separate training and evaluation for each stage to allow for clear attribution of error reduction. A label-free scoring system is employed to determine which blocks require refinement based on their risk levels, optimizing computational resources.
Results
ARC-STAR achieved a reduction in velocity rollout error by at least 36Γ over raw Poseidon predictions across all ten benchmark-regime cells. The global correction stage reduced raw host error by 91β99%, while the local refinement stage further decreased the remaining error by up to 94.4%. The framework demonstrated superior performance in seven out of ten regime cells compared to existing methods.
Implications
The findings suggest that targeted correction strategies can significantly enhance the performance of pretrained PDE models in real-world applications, particularly in fields such as fluid dynamics and engineering simulations. The ability to audit and allocate computational resources efficiently could lead to more robust and reliable predictive modeling in complex systems.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Large Language Models
Reinforcement Learning
Optimization
- Introduction of Memory-R2 framework for training memory-augmented LLM agents.
- LoGo-GRPO algorithm enhances credit assignment fairness through local and global optimization.
- Shared-parameter architecture facilitates joint optimization of memory formation and evolution.
- Progressive curriculum learning stabilizes long-horizon RL training.
Read more
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Summary
The paper introduces Memory-R2, a training framework designed for long-horizon memory-augmented large language model (LLM) agents. The challenge addressed is the non-stationarity introduced by memory in multi-session environments, which complicates trajectory-level comparisons and credit assignment in reinforcement learning (RL). Memory-R2 employs a core algorithm called LoGo-GRPO that combines local and global group-relative optimization to ensure fairer credit assignment by comparing trajectories from identical intermediate memory states while preserving end-to-end learning from long-horizon rewards. Additionally, it optimizes the entire memory lifecycle by utilizing a shared-parameter architecture for a fact extractor and a memory manager, allowing for joint optimization of memory formation and evolution. The framework also incorporates a progressive curriculum that gradually increases the training horizon, enhancing stability and data efficiency in training. Overall, Memory-R2 significantly improves the training paradigm for memory-augmented LLM agents, enabling more effective long-horizon interactions.
Methodology
The methodology involves the development of the Memory-R2 framework, which includes the LoGo-GRPO algorithm for credit assignment, a shared-parameter architecture for memory management, and a progressive curriculum for training. The framework allows for local rerollouts to ensure fair comparisons and optimizes memory operations through a multi-step decision process.
Results
The results indicate that Memory-R2 significantly improves the accuracy and efficiency of memory-augmented LLM agents in long-horizon multi-session settings. The framework's design leads to more precise credit assignment and better overall performance in tasks requiring memory management.
Implications
The implications of this work suggest that Memory-R2 can enhance the capabilities of LLM agents in applications requiring long-term memory and interaction, such as conversational agents, personal assistants, and any domain where maintaining context over extended interactions is critical.
From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs
Optimization
Efficient ML
Theory
- Introduction of a hybrid CPU-GPU framework for optimizing k-sparse GLMs.
- Implementation of batched processing for multiple BnB nodes on GPUs, overcoming sequential processing limitations.
- Development of GPU-efficient routines and a padding strategy to handle irregular data structures.
- Demonstration of significant speedups (1-2 orders of magnitude) and zero optimality gap on challenging instances.
Read more
From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs
Summary
This paper addresses the challenges of optimizing cardinality-constrained generalized linear models (GLMs) using GPUs, particularly focusing on the limitations of traditional branch and bound (BnB) methods that process nodes sequentially. The authors propose a novel CPU-GPU framework that allows for the simultaneous processing of multiple BnB nodes in batches on GPUs, significantly improving computational efficiency. Key innovations include a hybrid framework where the CPU manages the irregular tree-search logic while the GPU handles batched numerical computations. The authors introduce GPU-efficient routines and a padding strategy to manage irregular data structures, enabling effective use of GPU resources. They also demonstrate that relaxed indicator variables can be accurately derived from the relaxed coefficient vector, facilitating efficient support and variable selection directly on the GPU. Empirical results show that their approach achieves one to two orders of magnitude speedup compared to traditional methods, certifying optimality on complex instances. Additionally, the framework can be adapted to collect the Rashomon set of near-optimal k-sparse GLMs, enhancing model selection and variable importance analysis.
Methodology
The authors developed a modular CPU-GPU framework that separates the management of the BnB search tree (handled by the CPU) from the numerical computations (executed on the GPU). They utilized GPU-efficient routines for matrix operations and implemented a padding strategy to standardize irregular node data structures. The framework allows for the simultaneous processing of multiple nodes, reducing the need for frequent CPU-GPU data transfers.
Results
The proposed framework demonstrated substantial performance improvements, achieving speedups of one to two orders of magnitude over traditional single-node processing methods. It successfully certified optimality on difficult sparse GLM instances where previous methods left significant optimality gaps.
Implications
The framework has potential applications in various fields requiring efficient optimization of sparse models, such as scientific research, finance, and medical diagnostics. It also facilitates advanced statistical analyses, including variable importance assessments and model selection based on multiple criteria.
Relational Linear Properties in Language Models: An Empirical Investigation
NLP
Large Language Models
Interpretability
- Introduces a novel probing method using Kullback-Leibler divergence to evaluate relational linearity in language models.
- Demonstrates that relational linearity varies across different models and layers, with specific patterns in how linguistic information is represented.
- Finds that the phrasing of relational queries significantly affects the observed linearity.
- Shows that language models encode relational properties in a largely linear manner, particularly for tense and truthfulness.
Read more
Relational Linear Properties in Language Models: An Empirical Investigation
Summary
This paper investigates the concept of relational linearity in language models, which posits that the unembedding of an object can be predicted from the embedding of its subject using a linear transformation. The authors introduce a novel probing method based on Kullback-Leibler divergence to empirically test this hypothesis across various language models. They analyze how relational linearity manifests in the latent representations of models like Llama-3.1 and Gemma-2, focusing on its variation across different layers and the impact of paraphrased relational queries. The findings reveal that relational linearity is present in language models, with significant variations depending on the model and the specific relational properties being examined. The study also highlights the efficiency of their probing method compared to previous approaches, which relied on less precise Jacobian approximations.
Methodology
The authors developed a probing procedure that evaluates relational linearity by comparing model embeddings when prompted with both context and context-query pairs. This method utilizes Kullback-Leibler divergence to assess the linear relationship between subject and object embeddings, avoiding the use of crude Jacobian approximations from previous studies.
Results
The experiments revealed that relational linearity is present in language model representations, with notable differences across models and layers. Specifically, the study found that relational properties are more pronounced in middle layers, while surface features emerge in earlier layers. Additionally, the way queries are phrased significantly influences the results of linear probing, indicating that relational linearity is sensitive to linguistic variations.
Implications
The findings suggest that understanding relational linearity can enhance the interpretability of language models and improve their performance in tasks involving entity relations. This could have applications in natural language understanding, knowledge extraction, and model fine-tuning.
ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
Graph Learning
- CONTACT architecture separates contact identification from sequence prediction, improving model performance.
- Introduces a contact-gated injection mechanism that selectively routes antigen information to relevant CDR positions.
- Achieves significant improvements in structural quality and epitope awareness over existing CDR design methods.
Read more
ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
Summary
The paper introduces CONTACT, a novel architecture for antibody complementarity-determining region (CDR) design that addresses the limitations of existing methods by explicitly separating the tasks of contact identification and amino acid selection. Traditional approaches conflate these tasks, leading to ineffective use of antigen information. CONTACT decomposes the CDR design process into three stages: learning surface complementarity fingerprints, predicting CDR-antigen contacts, and injecting contact-gated antigen features into the sequence prediction. This architecture employs a distance-biased cross-attention mechanism to prioritize spatially relevant features and utilizes a contact-weighted cross-entropy loss to focus learning on critical binding positions. The authors demonstrate that CONTACT significantly outperforms existing methods on the CHIMERA-BENCH dataset, achieving superior structural quality, epitope awareness, and competitive sequence recovery, thereby validating the effectiveness of the contact-first design paradigm.
Methodology
The methodology involves a three-stage process: first, learning surface complementarity fingerprints to characterize the binding environment; second, predicting which CDR positions will contact the antigen; and third, injecting local antigen features into the CDR representation based on predicted contact confidence. A distance-biased cross-attention module is used to encode geometric priors, and a contact-weighted cross-entropy loss is applied to emphasize learning at binding-critical positions.
Results
CONTACT achieved the best RMSD of 1.63 Γ
(7% improvement over the next-best baseline), an F1 score of 0.79 for epitope awareness (10% improvement over GNN baselines), and competitive sequence recovery metrics, demonstrating its effectiveness compared to eleven other CDR-H3 design baselines.
Implications
The findings suggest that explicitly separating contact identification from sequence prediction can lead to more effective antibody design, potentially impacting therapeutic antibody development and enhancing the understanding of antibody-antigen interactions.
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Efficient ML
Theory
Optimization
- Partial fusion allows for a flexible trade-off between ensemble accuracy and computational cost.
- The method aggregates only the most similar neurons between networks, improving efficiency.
- Generalized pruning offers a similar flexibility by allowing for neuron isolation and linear combinations.
- The proposed methods show competitive performance on benchmark datasets like MNIST.
Read more
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Summary
This paper introduces a novel approach called partial fusion of neural networks, which provides a flexible trade-off between the computational costs associated with ensemble methods and the accuracy of weight aggregation techniques. Traditional ensembles improve performance but require significant computational resources, while weight aggregation is less costly but often yields lower accuracy. The authors propose a method that aggregates only the weights of neurons that are most similar across different networks, utilizing partial optimal transport to achieve this. This approach allows for a more efficient model that retains higher accuracy than simple weight aggregation while being less resource-intensive than full ensembles. The paper also discusses generalized pruning, which allows for the isolation, deletion, and linear combination of neurons based on similarity, yielding benefits similar to those of partial fusion. The proposed methods are evaluated on neural networks trained on the MNIST dataset, demonstrating that partial fusion can achieve accuracy levels closer to ensembles while maintaining a lower parameter count.
Methodology
The authors utilize partial optimal transport to match and aggregate weights of similar neurons from different neural networks. They also explore generalized pruning, which allows for flexible neuron manipulation, including isolation and linear combinations, to enhance model performance.
Results
The experiments demonstrate that the partial fusion method achieves test accuracies closer to those of ensembles while only requiring approximately 1.45 times the parameters of the original networks, compared to 2 times for traditional ensembles. Generalized pruning also shows similar performance benefits, indicating its effectiveness as an alternative approach.
Implications
The proposed methods can significantly reduce the computational burden of deploying neural network ensembles while maintaining high accuracy, making them suitable for applications in resource-constrained environments. This could lead to more efficient model deployment in various machine learning tasks.
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
Reinforcement Learning
Robotics
Optimization
- Introduction of Trajectory Reachability Metrics (TRM) to enhance terminal candidate scoring in latent world models.
- Horizon-aware supervision is critical for training metrics that align with long-horizon planning tasks.
- Mechanistic evidence shows that TRM improves candidate ordering and decision-making compared to raw latent proximity metrics.
- TRM significantly boosts performance on benchmark tasks, demonstrating its effectiveness across different models.
Read more
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
Summary
This paper addresses the limitations of latent world models in model-based reinforcement learning, particularly the reliance on Euclidean distance for terminal-cost evaluation, which can misguide action selection. The authors introduce Trajectory Reachability Metrics (TRM), a post-hoc terminal-ranking method that enhances the decision-making process by training a pairwise distance metric from logged trajectory data. TRM is designed to be horizon-aware, ensuring that the training distribution aligns with the long-horizon candidate ranking problem. The method was tested on the TwoRoom benchmark, where traditional latent planning achieved only 7.0% success, while TRM significantly improved performance to 97.0%. The study also provides mechanistic insights into why TRM outperforms raw latent metrics, demonstrating that the XY position can be accurately decoded, yet raw latent Mean Squared Error (MSE) fails to rank candidates effectively. The findings suggest that TRM can serve as a valuable auxiliary cost in continuous manipulation tasks, leading to improved candidate ordering and decision-making in planning.
Methodology
The authors propose TRM as a post-hoc terminal-ranking layer that replaces or augments the traditional Euclidean distance metric in latent world models. TRM is trained using a pairwise distance metric derived from logged trajectory data, focusing on horizon-aware supervision to ensure that the training distribution matches the long-horizon candidate ranking problem. The method is evaluated through controlled experiments on the TwoRoom benchmark and other tasks, using Same-Candidate Selection Audits (SCSA) to analyze candidate ordering improvements.
Results
In experiments, the raw latent planning with LeWorldModel (LeWM) achieved only 7.0% mean success on the TwoRoom benchmark, while the full-horizon TRM method reached 97.0%. Additionally, a PLDM baseline improved from 32.7% to 84.0% success rates. The study also found that a short-horizon TRM variant performed poorly at 35.0%, indicating the importance of horizon-aware training.
Implications
The findings suggest that TRM can significantly enhance the performance of latent world models in reinforcement learning by providing a more accurate metric for decision-making. This has potential applications in various planning and manipulation tasks, where accurate candidate selection is crucial for success. The approach may also inform future research on improving model-based control systems and hybridizing metrics in complex environments.
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Generative Models
Theory
Time Series
- Introduces a physics-informed generative framework for spatiotemporal field reconstruction.
- Employs Martingale-Regularized Score Matching to stabilize generative priors.
- Utilizes Physics-Informed Implicit Score Sampling for inference, ensuring physical consistency.
- Demonstrates effectiveness in acoustic systems and generalizes to chaotic flows and meteorological fields.
Read more
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Summary
This paper presents a novel framework called the Physics-Informed Generative Solver (PIGS) designed to address the challenge of reconstructing spatiotemporal fields from sparse measurements, a common issue in physical sciences. Traditional data-driven methods often fail to respect the underlying physical laws, leading to inaccurate reconstructions. The authors propose a two-step approach: during training, they utilize Martingale-Regularized Score Matching (MRSM) to create a stable generative prior by coupling denoising score matching with a Score FokkerβPlanck Equation constraint. This ensures that the generative model adheres to a reverse martingale property, promoting dynamical stability. During inference, the Physics-Informed Implicit Score Sampling (PI-ISS) method projects generated samples towards the physical manifold by incorporating conservation-law residuals without embedding physical penalties during training. This separation allows for flexible reconstructions from extremely sparse and incomplete data while maintaining physical consistency. The framework is validated in acoustic systems, where it successfully generates coupled pressure and particle velocity fields, and is shown to generalize to chaotic flows and large-scale meteorological fields. The work bridges the gap between generative AI and first-principles science, offering a robust solution for high-dimensional inverse problems.
Methodology
The methodology involves a two-step process: (1) Training with Martingale-Regularized Score Matching (MRSM) to create a stable generative prior by coupling denoising score matching with a Score FokkerβPlanck Equation constraint, and (2) Inference using Physics-Informed Implicit Score Sampling (PI-ISS) to project samples towards the physical manifold by back-propagating conservation-law residuals.
Results
The proposed framework successfully reconstructs coupled pressure and particle velocity fields from sparse measurements in acoustic systems, effectively transforming sparse physical arrays into dense virtual arrays and suppressing spatial aliasing. It also demonstrates generalizability to chaotic Kolmogorov flows and large-scale meteorological fields under extreme data sparsity.
Implications
The findings suggest that the PIGS framework can be applied to various fields requiring high-dimensional inverse problem solutions, such as meteorology, fluid dynamics, and other physical sciences, potentially revolutionizing how sparse data is utilized in scientific research.
Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series
Time Series
- Introduces AC-GATE for entity-conditioned heterogeneous lag discovery in panel time series.
- Establishes a layered audit protocol for evaluating model outputs beyond predictive accuracy.
- Demonstrates the ability of AC-GATE to recover true lag structures in synthetic data.
- Shows that AC-GATE generates non-degenerate effective lags in real-world data.
Read more
Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series
Summary
This paper addresses the challenge of auditing how different entities respond to historical signals over varying time horizons in panel time series data. Traditional methods often fail to provide auditable, entity-specific lag summaries. The author proposes AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate, which formulates entity-conditioned heterogeneous lag discovery as a temporal panel mining task. AC-GATE utilizes observable entity-level proxies to condition lag-weight distributions, allowing effective lags to be structural outputs rather than mere post-hoc explanations. The evaluation framework includes a layered audit protocol that separates predictive calibration from lag discovery, employing both synthetic and real-world country-level panel data for testing. The results demonstrate that AC-GATE successfully recovers heterogeneous lag structures in synthetic data and generates meaningful effective lags in real-world scenarios, highlighting its potential for enhancing interpretability in panel time series analysis.
Methodology
The methodology involves formulating the problem of entity-conditioned heterogeneous lag discovery as a temporal panel mining task. AC-GATE employs an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate to model lag distributions based on observable entity-level proxies. The evaluation is conducted through a layered audit protocol that distinguishes between predictive calibration and lag discovery, using both synthetic and real-world datasets.
Results
The results indicate that AC-GATE effectively recovers heterogeneous lag structures in synthetic panels and produces structured effective lags in two real-world country-level panels. The evaluation confirms the model's capability to generate auditable outputs that reflect entity-specific responses to historical signals.
Implications
The findings suggest that AC-GATE can significantly improve the interpretability and auditability of panel time series models, making it a valuable tool for researchers and analysts in fields such as economics, environmental studies, and any domain where understanding entity-specific temporal responses is crucial.
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Reinforcement Learning
- Frequent strategy switching in Clash Royale is associated with lower win rates.
- The Zero Switching Cost Assumption overlooks the behavioral costs of switching strategies.
- The Transition Quality Predictor (TQP) reformulates strategy recommendations into a structured decision-making process.
- The TQP includes mechanisms to identify when and for whom switching strategies is beneficial.
Read more
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Summary
This paper investigates the phenomenon of strategy switching in competitive gaming, specifically in Clash Royale, where players often change strategies after losing streaks. Analyzing 926,334 match records from 34,619 players, the authors found that frequent switching is inversely related to win rates, challenging the assumption that switching strategies is always beneficial. They introduce the concept of the Zero Switching Cost Assumption, which overlooks the behavioral costs associated with switching strategies. To address this, the authors propose a new framework called the Transition Quality Predictor (TQP), which reformulates strategy recommendations into a three-stage decision-making process: WHO (identifying the player), WHEN (determining the optimal timing for a switch), and WHAT (selecting the best strategy). The TQP incorporates a PersonaGate to suppress recommendations for players who benefit from consistency, a TimingGate to identify beneficial switching moments, and ScoreFusion to rank strategies based on predicted transition quality. The paper also introduces the SwitchGap metric to evaluate the effectiveness of recommendations without assuming that observed player choices are optimal. The proposed pipeline demonstrates a significant improvement in recommendation quality, particularly benefiting loss-triggered switchers.
Methodology
The authors developed the Transition Quality Predictor (TQP) as a three-stage pipeline consisting of PersonaGate, TimingGate, and ScoreFusion. This approach integrates player behavioral analysis and win-rate predictions to optimize strategy recommendations based on individual player profiles and situational contexts.
Results
The TQP achieved a SwitchGap improvement of +10.4 percentage points at a recommendation rate of 5.4%. It was particularly effective for loss-triggered switchers, who, despite being the lowest-performing group, benefited significantly from subtype-conditioned guidance.
Implications
The findings suggest that game recommendation systems should consider individual player behavior and the timing of strategy switches to enhance player performance. This approach could be applied to other competitive gaming environments and recommendation systems beyond gaming.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Large Language Models
Optimization
Theory
- Introduction of Layerwise Learning Rate (LLR) for adaptive learning rates in Transformer layers.
- LLR is based on Heavy-Tailed Self-Regularization (HT-SR) theory, promoting balanced training across layers.
- Extensive experiments show LLR achieves up to 1.5Γ training speedup and improved zero-shot accuracy.
- LLR has low tuning overhead, making it practical for real-world applications.
Read more
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Summary
This paper addresses the limitations of using a uniform learning rate (LR) across all layers in Transformer architectures, which are commonly used in large language models (LLMs). The authors propose a novel approach called Layerwise Learning Rate (LLR), which assigns distinct learning rates to individual layers based on their structural characteristics, specifically their heavy-tailedness as characterized by Heavy-Tailed Self-Regularization (HT-SR) theory. By analyzing the empirical spectral density of weight correlation matrices, LLR allocates larger learning rates to layers with weaker heavy-tailedness to accelerate their training while assigning smaller learning rates to layers with stronger heavy-tailedness. This method aims to promote balanced training across layers, leading to faster convergence and improved generalization. The authors conducted extensive experiments across various architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60Mβ1B), demonstrating that LLR can achieve up to 1.5Γ training speedup and improve zero-shot accuracy from 47.09% to 49.02%. Notably, LLR requires minimal tuning overhead, as it can inherit nearly optimal LR settings from the uniform baseline.
Methodology
The methodology involves periodically measuring the heavy-tailedness of each layer's weight spectrum using empirical spectral density analysis. Based on this analysis, LLR assigns larger learning rates to layers with weaker heavy-tailedness and smaller rates to those with stronger heavy-tailedness. The adjustment is driven by spectral statistics, specifically using Power-Law fitting to compute PL_Alpha_Hill metrics for each layer.
Results
The results indicate that LLR significantly enhances training efficiency, achieving up to 1.5Γ faster training times compared to uniform learning rates. Additionally, LLR improved zero-shot accuracy from 47.09% to 49.02% across various model architectures and optimizers, demonstrating its effectiveness in promoting balanced training.
Implications
The findings suggest that LLR can be a valuable tool for optimizing training in large language models, potentially leading to better performance and efficiency in various NLP applications. Its low tuning overhead makes it suitable for deployment in practical scenarios where rapid model iteration is necessary.
AMUSE: Anytime Muon with Stable Gradient Evaluation
Optimization
Computer Vision
Large Language Models
- AMUSE integrates Muon's rapid bulk progress with Schedule-Free averaging for stable training.
- The method requires no learning rate schedules and supports anytime training.
- AMUSE shows consistent improvements across various vision tasks and large language model pretraining.
- The approach effectively reduces oscillations in the loss landscape, enhancing optimization stability.
Read more
AMUSE: Anytime Muon with Stable Gradient Evaluation
Summary
The paper introduces AMUSE (Anytime Muon with Stable Gradient Evaluation), a novel optimization method that combines the advantages of Muon and Schedule-Free optimization to enhance training efficiency in deep learning. Traditional optimization methods, such as AdamW, rely on fixed learning rate schedules, which can be limiting. AMUSE addresses this by integrating Muon's orthogonalized momentum updates with a time-varying interpolation coefficient that stabilizes gradient evaluations. This approach allows for rapid adaptation during training while gradually shifting towards a more stable averaged sequence, effectively reducing oscillations caused by high-curvature directions in the loss landscape. The authors analyze the river-valley loss landscape to understand the dynamics of Muon and its oscillatory behavior, leading to the design of AMUSE. The method is evaluated across various benchmarks, demonstrating significant improvements in performance and efficiency over existing methods, including Schedule-Free AdamW and Muon.
Methodology
AMUSE employs a time-varying interpolation coefficient to evaluate gradients, initially focusing on the fast Muon sequence for quick adaptation and gradually transitioning to a stable averaged sequence. This method leverages the river-valley loss landscape analysis to mitigate oscillations caused by high-curvature directions during training.
Results
AMUSE outperforms both Schedule-Free AdamW and Muon, achieving Muon's final performance with significantly fewer training steps: 1.51Γ fewer steps for 720M LLM pretraining, 1.12Γ fewer for ImageNet with ResNet-50, and 3.08Γ fewer for ImageNet with ViT fine-tuning.
Implications
AMUSE's design allows for more efficient training of deep learning models without the need for explicit learning rate schedules, making it suitable for applications in various domains, including computer vision and natural language processing. Its anytime training capability could facilitate real-time applications and adaptive learning scenarios.
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
Generative Models
Optimization
Computer Vision
- SeqLoRA optimizes LoRA factors jointly while enforcing subspace orthogonality, addressing the expressiveness-interference trade-off.
- Theoretical analyses confirm convergence and reduced residual interference compared to traditional frozen-basis methods.
- SeqLoRA demonstrates superior identity preservation and scalability in multi-concept image generation tasks.
- The framework allows for efficient adaptation without the need for retraining or complex fusion processes.
Read more
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
Summary
The paper introduces SeqLoRA, a novel framework for parameter-efficient fine-tuning of text-to-image diffusion models, aimed at overcoming the challenges of representation interference when generating images with multiple personalized concepts. Traditional methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limits expressiveness and concept fidelity. SeqLoRA employs a constrained continual learning approach that optimizes both LoRA factors through bilevel optimization while maintaining subspace orthogonality. The authors provide theoretical guarantees for convergence and demonstrate that learning the LoRA basis from data significantly reduces residual interference compared to frozen-basis methods. Experiments reveal that SeqLoRA excels in identity preservation and scalability, effectively managing up to 101 concepts without the need for costly fusion processes. The results indicate that SeqLoRA successfully mitigates attribute entanglement, preserving distinct identities across complex multi-concept scenarios.
Methodology
SeqLoRA utilizes a bilevel optimization framework to jointly optimize LoRA factors while ensuring orthogonality between concept representations. The authors model residual layer activations as a matrix sub-Gaussian process to derive bounds on catastrophic forgetting and establish convergence guarantees.
Results
The experiments show that SeqLoRA outperforms existing methods in generating images with multiple concepts, achieving state-of-the-art results in identity preservation and scalability across up to 101 concepts. The method effectively reduces attribute interference, leading to coherent and distinct image generations.
Implications
SeqLoRA has significant implications for applications in personalized content generation, artistic creation, and any domain requiring the integration of multiple concepts in visual outputs. Its efficient adaptation capabilities can enhance user experience in creative tools and media production.
How Many Different Outputs Can a Transformer Generate?
Theory
- Transformers can only generate a finite set of output sequences, with many remaining inaccessible.
- The proportion of accessible sequences decays exponentially with sequence length beyond a critical threshold.
- An explicit formula is derived to predict thresholds for different transformer architectures.
- The findings provide a theoretical explanation for observed failures of transformers on simple tasks.
Read more
How Many Different Outputs Can a Transformer Generate?
Summary
This paper investigates the limitations of transformer architectures in generating output sequences, focusing on the relationship between prompt length and the number of accessible sequences. The authors derive an upper bound on the number of different outputs a transformer can produce, demonstrating that this number grows linearly with prompt length but decays exponentially for longer sequences beyond a critical threshold. The study provides a theoretical framework explaining why transformers struggle with simple tasks like copying sequences, revealing that many sequences remain fundamentally inaccessible regardless of model size or training. The authors also validate their theoretical predictions through empirical experiments across various transformer architectures, showing that their findings hold true even under unbounded context and computation time. This work highlights intrinsic architectural constraints that affect all transformers and suggests broader applicability to other models with similar limitations.
Methodology
The authors analyze the internal representation constraints of transformers, formalizing the concept of accessible sequences. They conduct a theoretical analysis to derive results regarding the relationship between prompt length and accessible sequences, and perform empirical experiments to validate their theoretical predictions across different transformer architectures.
Results
The study proves that the maximal length of accessible sequences grows linearly with prompt length, while the proportion of accessible sequences decays exponentially with sequence length beyond a certain threshold. The theoretical predictions closely match empirical observations across various transformer models.
Implications
The findings have significant implications for understanding the limitations of transformer models in sequence generation tasks, potentially guiding future architectural designs and training strategies to mitigate these constraints. The results may also inform the development of other models with bounded internal representations.
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Time Series
- Introduces spectral drift as a new perspective on subject-specific variability in biomedical time-series data.
- Proposes the Frequency-Band Alignment Module (FBAM) for adaptive alignment of spectral structures.
- Implements Sample Conditional Layer Normalization (SCLN) to stabilize cross-subject representations.
- Demonstrates superior performance of BioFormer over 12 baseline methods with a 6% absolute F1-score improvement.
Read more
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Summary
The paper addresses the challenge of cross-subject generalization in biomedical time-series (BTS) data, where models trained on data from certain subjects perform poorly on unseen subjects due to subject-specific variability. The authors introduce the concept of 'spectral drift' to characterize this variability, which manifests as consistent oscillatory structures in BTS signals that differ in magnitude and phase across subjects. To mitigate this variability, they propose BioFormer, which includes a Frequency-Band Alignment Module (FBAM) that generates modulation factors to align spectral structures across subjects. This module adjusts amplitude and phase in the Fourier domain to enhance task-relevant features while suppressing subject-dependent variations. Additionally, the authors implement Sample Conditional Layer Normalization (SCLN) to stabilize representations by inferring normalization parameters from intrinsic signal statistics rather than subject identity. The effectiveness of BioFormer is demonstrated through extensive experiments on six datasets, showing significant improvements over 12 baseline methods, particularly in F1-score metrics.
Methodology
The methodology involves the introduction of the Frequency-Band Alignment Module (FBAM) that models subject variability as structured spectral drift, performing adaptive frequency-band alignment in the Fourier domain. The FBAM utilizes cross-attention mechanisms to derive task-aware modulation coefficients for amplitude and phase adjustments. Additionally, Sample Conditional Layer Normalization (SCLN) is employed to calibrate sample-level feature statistics, mitigating residual distribution shifts without relying on subject identity.
Results
The experiments conducted on six biomedical time-series datasets show that BioFormer outperforms 12 baseline models, achieving an absolute F1-score improvement of 6%. This indicates a significant enhancement in the model's ability to generalize across subjects, effectively addressing the challenges posed by subject-specific variability.
Implications
The findings suggest that explicitly modeling subject-specific variability can lead to more robust and generalizable models in biomedical applications, such as disease screening and monitoring. The proposed methods may also be applicable to other domains where cross-subject or cross-domain generalization is critical.
Harnesses for Inference-Time Alignment over Execution Trajectories
NLP
Large Language Models
Theory
- Harness design is framed as an inference-time alignment problem, focusing on workflow and guidance components.
- Optimal granularity in task decomposition must align with the agent's capabilities and retry budgets.
- Guidance improves performance only when it aligns with task evidence; misalignment can lead to hallucinations.
- Partial harnessing, which specifies only initial task stages, can outperform fully structured workflows.
Read more
Harnesses for Inference-Time Alignment over Execution Trajectories
Summary
This paper investigates the design of harnesses for large language model (LLM) agents, focusing on inference-time trajectory alignment to enhance long-term performance. The authors identify two main components of harness design: task decomposition, which breaks tasks into sub-goals, and guided execution, which influences action distributions during execution. They argue that while more elaborate harnesses might seem beneficial, they can lead to diminishing returns or even failures such as over-decomposition and hallucinated execution. The study introduces the concept of 'Partial Harnessing,' where only the initial steps of a task are specified, allowing the agent to handle the rest autonomously. This approach can yield higher success rates compared to fully structured workflows. The findings are validated through controlled experiments and benchmarks, emphasizing the importance of aligning harness design with the agent's capabilities and the task evidence.
Methodology
The authors conducted theoretical modeling of harness design, separating it into workflow and guidance components. They performed controlled synthetic experiments and evaluated real-world benchmarks (Terminal-Bench v2) to validate their findings regarding the effectiveness of partial harnessing and the alignment principles.
Results
The experiments demonstrated that partial harnessing can lead to higher pass rates compared to fully specified workflows. The study identified specific failure modes associated with over-decomposition and misaligned guidance, confirming the theoretical predictions about the importance of aligning harness design with agent capabilities.
Implications
The findings suggest that harness design should not only focus on adding structure but also consider when to stop adding it. This has implications for the development of more efficient LLM agents capable of executing complex tasks with greater autonomy and reliability.
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Time Series
Theory
- The paper rigorously defines and decomposes sources of uncertainty in weather forecasting.
- It systematically compares various parameterization strategies, including novel machine learning approaches.
- Stochastic parameterizations with persistent structures improve early spread growth and spread-error consistency.
- The study provides insights into how different uncertainties interact in chaotic systems.
Read more
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Summary
This paper addresses the inherent uncertainties in weather and climate forecasting, which arise from chaotic dynamics, imperfect initial conditions, and model deficiencies. The authors utilize the two-scale Lorenz 1996 (L96) system as a controlled testbed to systematically analyze and decompose the contributions of intrinsic variability, initial-condition perturbations, and stochastic model uncertainty to ensemble spread. They compare various ensemble configurations and parameterization strategies, including deterministic, autoregressive, Bayesian, and novel machine learning-based approaches. The findings indicate that ensemble perturbations do not increase long-term variance but rather influence the rate of trajectory decorrelation and exploration of the invariant measure. Stochastic parameterizations, especially those with temporally persistent structures, significantly enhance early spread growth and improve the consistency between spread and error. This work clarifies the interactions between different sources of uncertainty in chaotic systems and offers practical guidance for designing and evaluating stochastic parameterizations in weather and climate models.
Methodology
The authors developed a systematic approach to evaluate the contributions of internal variability, initial-condition uncertainty, and model uncertainty in the L96 system. They employed multiple ensemble configurations and analyzed the effects of different parameterization strategies on spread growth, forecast skill, and ensemble calibration using both stationary and dynamical diagnostics.
Results
The study revealed that ensemble perturbations regulate the rate of decorrelation rather than increasing long-term variance. Stochastic parameterizations, particularly those with temporally persistent structures, were found to enhance early spread growth and improve the consistency between ensemble spread and forecast error.
Implications
The findings have significant implications for the design and evaluation of stochastic parameterizations in operational weather and climate models, potentially leading to improved forecast accuracy and reliability.
Integrable Elasticity via Neural Demand Potentials
Theory
Optimization
Interpretability
- ICDN formulates multiproduct elasticity estimation as a demand-first problem, allowing elasticities to be derived as derivatives of a learned demand surface.
- The model combines linear and nonlinear components to capture complex price interactions while maintaining analytical tractability.
- ICDN utilizes analytic derivatives of spline bases for efficient elasticity computation, avoiding the need for dense Jacobian evaluations.
- The approach incorporates SKU-specific contextual conditioning and attention-modulated interactions to enhance demand estimation.
Read more
Integrable Elasticity via Neural Demand Potentials
Summary
This paper introduces the Integrable Context-Dependent Demand Network (ICDN), a novel neural model designed for estimating multiproduct retail demand. The ICDN learns log-demand as a smooth function of log-prices and contextual factors, enabling the exact derivation of elasticities from the demand surface. The authors highlight the challenges of estimating stable and economically plausible elasticities in a multiproduct context, where demand is influenced by various factors such as promotions and price correlations. The ICDN model addresses these challenges by decomposing price effects into interpretable linear components and flexible spline-based nonlinear refinements. The empirical evaluation on Dominickβs beer dataset demonstrates that ICDN outperforms traditional benchmarks in terms of out-of-sample generalization and provides more stable elasticity estimates, particularly for cross-price effects. The methodology balances predictive accuracy with economic regularity, ensuring that the derived elasticities are coherent and reflect expected demand patterns.
Methodology
The ICDN model is structured to represent log-demand as a smooth function of log-prices and contextual covariates. It decomposes price responses into linear and nonlinear components, utilizing spline bases for flexibility. The model employs a shared encoder for contextual information and allows for independent learning of cross-price effects between products. Elasticities are computed analytically from the spline derivatives, ensuring scalability and transparency in multiproduct settings.
Results
Empirical results on the Dominickβs Finer Foods dataset indicate that ICDN significantly improves out-of-sample generalization compared to traditional log-log benchmarks. The model yields more stable and economically plausible elasticity estimates, particularly for weakly identified cross-price effects, confirming the effectiveness of the demand-first approach.
Implications
The findings suggest that ICDN can be a valuable tool for retailers in optimizing pricing strategies and understanding demand dynamics across multiple products. The model's ability to provide stable elasticity estimates can enhance revenue management and promotional planning in complex retail environments.
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Optimization
Interpretability
- Introduction of a log-driven AutoML framework for reproducible pipeline optimization.
- Evaluation of over 18,000 pipeline configurations reveals a structured search space.
- Key performance drivers identified include augmentation, model choice, and imbalance handling.
- Ensemble models achieve strong performance, with a Macro-F1 score of 0.88 on Pima and 0.94 on Stroke.
Read more
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Summary
This paper presents a novel AutoML framework named yvsoucom-iterkit, designed to enhance reproducibility and interpretability in healthcare risk prediction. The framework addresses challenges such as heterogeneous features, limited sample sizes, and severe class imbalance by formulating pipeline optimization as a fully reproducible configuration-level system. Each pipeline is represented as a traceable log entity, facilitating the analysis of component interactions and performance attribution. The authors conducted extensive experiments on the Pima Indians Diabetes and Stroke datasets, evaluating over 18,000 pipeline configurations. The results reveal a structured and partially redundant search space, where a small subset of components significantly influences performance. Key findings include the identification of crucial factors such as augmentation, model choice, and imbalance handling, which drive performance in different datasets. Ensemble models demonstrated strong and stable performance, with a Macro-F1 score of approximately 0.88 on the Pima dataset and 0.94 on the Stroke dataset, although performance dropped to 0.67 on Stroke under severe class imbalance. The framework's cross-seed analysis highlighted a trade-off between performance and robustness, showing that ensembles are more stable than SVM models. Overall, the study emphasizes the importance of joint optimization across preprocessing and modeling stages in AutoML for healthcare applications.
Methodology
The study employs a pipeline-centric AutoML framework that optimizes preprocessing, data augmentation, imbalance handling, and classification models within a unified configuration space. It utilizes a log-driven execution paradigm to ensure reproducibility and traceability of experiments, allowing for comprehensive analysis of pipeline configurations and their performance outcomes.
Results
The framework achieved strong performance metrics, with ensemble models yielding a Weighted-F1 score of 0.89 and a Macro-F1 score of 0.88 on the Pima dataset, and a Weighted-F1 score of 0.94 on the Stroke dataset. However, the Macro-F1 score on the Stroke dataset was lower at 0.67 due to class imbalance. The analysis revealed that many configurations in the AutoML search space were redundant, and a small subset of components was responsible for the majority of the performance.
Implications
The proposed framework can significantly enhance the reproducibility and interpretability of machine learning models in healthcare risk prediction. Its ability to systematically explore and optimize preprocessing and modeling strategies can lead to more reliable predictions and better patient outcomes. Additionally, the framework's design allows for application in other domains involving categorical prediction tasks.
Generative Modeling by Value-Driven Transport
Generative Models
Reinforcement Learning
Optimization
- Introduces a new framework for generative modeling based on discrete-time stochastic control.
- Develops a primal-dual algorithm for efficiently computing value functions and VDT policies.
- Demonstrates that VDT policies can significantly reduce the number of generation steps required.
- Shows that the learned value functions can be symmetrically applied for transport in both directions.
Read more
Generative Modeling by Value-Driven Transport
Summary
This paper introduces a novel framework for generative modeling that leverages a discrete-time stochastic control approach to measure transport. By reformulating the dynamic optimal transport problem as a linear programming (LP) problem, the authors derive an efficient simulation-free primal-dual algorithm to compute approximately optimal value functions and the corresponding value-driven transport (VDT) policies. These VDT policies exhibit several advantages over existing methods, including faster simulation of transport paths and the ability to incorporate enhancements like conditional generation and unpaired data translation. The methodology is validated through extensive experiments, demonstrating strong performance and scalability, particularly in reducing the number of generation steps needed during inference. The authors argue that their approach, rooted in optimal control rather than traditional continuous-time methods, offers a fresh perspective on generative modeling, with potential applications in various control tasks within reinforcement learning.
Methodology
The authors reformulate the dynamic optimal transport problem as a discrete-time stochastic control problem, allowing them to derive a linear programming formulation. They then develop a primal-dual method to approximate the optimal value functions and transport policies, utilizing Wasserstein gradient descent for efficient updates without requiring simulated trajectories.
Results
The experiments conducted show that the proposed VDT policies outperform state-of-the-art methods in terms of robustness and scalability. The ability to reduce the number of generation steps by up to a factor of 10 during inference is a notable finding, indicating the efficiency of the learned transport paths.
Implications
The proposed framework could significantly enhance generative modeling tasks across various domains, including image generation and data translation. Its foundation in optimal control may also inspire new approaches in reinforcement learning and other control-related tasks.
EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes
NLP
Large Language Models
- EmoTrack integrates structured clinical signals with turn-level semantics for improved depression tracking.
- LONGCOUNSEL-8 dataset introduces session-level PHQ-8 supervision for multi-session counseling evaluation.
- The framework effectively utilizes prior-session context while minimizing noise from historical data.
- EmoTrack shows significant performance improvements over existing single-session and longitudinal benchmarks.
Read more
EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes
Summary
The paper presents EmoTrack, a novel framework for predicting depression severity using counseling transcripts, addressing the challenges of robust PHQ-8 prediction across single-session and multi-session settings. The authors highlight the limitations of existing methods, which either rely heavily on labeled data or treat transcripts holistically without considering longitudinal context. To overcome these issues, they introduce LONGCOUNSEL-8, a multi-session counseling dataset that provides session-level PHQ-8 supervision, facilitating the evaluation of depression tracking under various conditions. EmoTrack combines clinical signals extracted from large language models (LLMs) with turn-level semantic embeddings and employs a compact cross-session memory to enhance prediction accuracy. The framework effectively integrates evidence from both current and prior sessions while mitigating noise from historical data. Experimental results demonstrate that EmoTrack significantly outperforms existing benchmarks, achieving a 13.5% relative reduction in mean absolute error (MAE) on the DAIC-WOZ dataset and maintaining competitive performance on the LONGCOUNSEL-8 dataset.
Methodology
EmoTrack employs a hybrid approach that extracts structured clinical features from transcripts using LLMs, combines them with frozen turn-level semantic embeddings, and trains symptom-specific predictors. It utilizes self-attention mechanisms to integrate evidence and employs a compact attention-based memory to incorporate prior-session context selectively.
Results
EmoTrack achieved a 13.5% relative reduction in mean absolute error (MAE) on the DAIC-WOZ dataset compared to the strongest baseline. It also demonstrated competitive performance on the LONGCOUNSEL-8 dataset, validating its effectiveness in both single-session and multi-session contexts.
Implications
The findings suggest that EmoTrack can enhance AI-driven mental health support systems by providing more accurate and context-aware assessments of depression severity, potentially leading to timely interventions and improved patient outcomes.
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Generative Models
Time Series
- Introduces Single-Cell Flow Matching (scFM) to model gene expression dynamics from sparse scRNA-seq data.
- Addresses challenges of ambiguous transitions and long-horizon prediction drift.
- Combines optimal transport alignment with generative modeling for improved temporal coherence.
- Demonstrates superior performance in trajectory reconstruction and distributional prediction.
Read more
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Summary
The paper addresses the challenge of inferring continuous gene expression dynamics from sparse single-cell RNA sequencing (scRNA-seq) snapshots, which are collected at discrete time points. Existing methods struggle with ambiguous local transitions due to unpaired snapshots and face issues with long-horizon predictions that can lead to distribution drift. To overcome these challenges, the authors propose a novel framework called Single-Cell Flow Matching (scFM). This framework integrates optimal transport (OT) for aligning snapshot distributions with a generative model that learns time-dependent velocity fields. The scFM method computes entropically regularized OT couplings to create soft flow-matching targets, learns bidirectional velocity fields to enhance temporal coherence, and incorporates distribution-level alignment to anchor long-term predictions. The authors demonstrate that scFM significantly improves the accuracy of trajectory reconstruction and temporal coherence in gene expression dynamics, outperforming existing methods in both interpolation and extrapolation tasks on real-world datasets.
Methodology
The scFM framework employs a latent generative model that utilizes optimal transport to compute couplings between adjacent snapshots. It constructs soft flow-matching targets for learning velocity fields and refines these fields through bidirectional learning. Additionally, it incorporates distribution-level alignment and latent dynamic regularization to mitigate drift during long-term predictions.
Results
Experiments on real-world scRNA-seq datasets show that scFM consistently outperforms existing methods, yielding improved distributional prediction performance for both temporal interpolation and extrapolation. The method also achieves more accurate trajectory reconstructions and temporally coherent visualizations, effectively recovering underlying gene expression dynamics even in the absence of intermediate time points.
Implications
The proposed scFM framework has significant implications for biological research, particularly in understanding cellular dynamics and transitions in complex biological systems. It can facilitate better modeling of developmental processes and disease progression by providing insights into gene expression changes over time.
The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces
Generative Models
Theory
- Introduces a novel framework for OOD detection based on goodness-of-fit testing in factorised latent spaces.
- Proposes the SITN method, which requires no OOD data and incurs minimal computational overhead.
- Demonstrates strict Type I error control and effective performance through comprehensive evaluations.
- Highlights the limitations of likelihood-based OOD detection methods and provides a solution that avoids their biases.
Read more
The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces
Summary
This paper addresses the challenge of out-of-distribution (OOD) detection using deep generative models, particularly focusing on Continuous Normalising Flows (CNFs). Previous methods relying on likelihoods for OOD detection have been shown to be unreliable, as they often assign higher likelihoods to OOD samples than to in-distribution samples. The authors propose a new method called Signal in the Noise (SITN), which leverages the properties of CNFs to detect OOD samples by analyzing their atypicality in a factorised latent space. SITN operates on a single-sample level, does not require access to OOD data, and maintains strict control over false positive rates. The method is validated through extensive evaluations on standard benchmarks and synthetic perturbations, demonstrating its effectiveness and highlighting its advantages over traditional likelihood-based methods.
Methodology
The authors leverage the diffeomorphic and mass-preserving properties of Continuous Normalising Flows to analyze the mapping of OOD samples to noise latents. They quantify atypicality at the single-sample level by testing the marginal normality and dimensional independence of latent vector elements, which are expected to behave as i.i.d. samples from a standard normal distribution.
Results
The SITN method was evaluated against standard benchmarks and synthetic perturbations, showing superior performance in OOD detection compared to existing likelihood-based methods. The results indicated that SITN effectively controls false positive rates while maintaining high detection accuracy.
Implications
The proposed method has significant implications for deploying machine learning models in high-risk domains, such as healthcare, where reliable OOD detection is crucial. By enabling models to identify when they are faced with unreliable data, SITN can help prevent catastrophic failures in decision-making processes.
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Robotics
Optimization
Theory
- Introduction of deep-kernel pairwise learning (DKPL) to enhance autonomous experimentation.
- DKPL integrates expert feedback to evaluate experimental outputs without relying on scalar metrics.
- Demonstrated effectiveness in identifying nanoscale structures and characterizing ferroelectric domain walls.
- Addresses limitations of traditional Bayesian optimization frameworks in capturing complex scientific phenomena.
Read more
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Summary
This paper presents a novel approach to autonomous experimentation in scientific discovery, particularly at the nanoscale, by developing a method called deep-kernel pairwise learning (DKPL). Traditional Bayesian optimization frameworks in self-driving laboratories rely on predefined scalar descriptors, which can limit their effectiveness in capturing complex phenomena. The authors argue that many important scientific insights are lost when raw experimental data is reduced to scalar metrics. DKPL addresses this limitation by incorporating expert feedback directly into the experimental process, allowing domain experts to evaluate the promise of experimental outputs without the constraints of scalar metrics. The method learns a latent utility function from expert judgments, guiding subsequent experiments more effectively. The authors demonstrate DKPL's capabilities through experiments on nanoscale structures and ferroelectric domain walls, showing that it can prioritize high-information measurement regions and distinguish between critical characteristics in materials. This work establishes a pathway for integrating human expertise into autonomous experimentation, enhancing the potential for scientific discovery beyond traditional scalar-driven approaches.
Methodology
The authors developed DKPL, a preference-driven active learning framework that utilizes pairwise comparisons of experimental outputs evaluated by experts. This method learns from expert judgments to construct a latent utility function, guiding the selection of subsequent experiments in a more informed manner than traditional scalar metrics.
Results
DKPL was shown to effectively learn and prioritize high-information measurement regions in a model dataset with known ground truth. In practical applications, it successfully distinguished between different domain-wall characteristics in ferroelectric materials, demonstrating its capability to handle complex, multidimensional data.
Implications
The integration of expert feedback into autonomous experimentation could significantly enhance the discovery process in materials science and other fields, allowing for exploration of complex phenomena that are not easily quantifiable. This approach may lead to more efficient and insightful scientific discoveries, particularly in areas where traditional metrics fall short.
What are the Right Symmetries for Formal Theorem Proving?
Theory
Large Language Models
- Introduces rewriting categories as a framework for modeling transformations in formal theorem proving.
- Defines proof equivariance and success invariance as critical symmetry properties for theorem provers.
- Demonstrates that LLM-based provers exhibit significant performance variability across equivalent formulations.
- Proposes a test-time aggregation method that improves robustness and proof success rates.
Read more
What are the Right Symmetries for Formal Theorem Proving?
Summary
This paper addresses the sensitivity of formal theorem provers based on large language models (LLMs) to variations in problem representation, which can lead to significant differences in proof success rates for semantically equivalent statements. The authors introduce a category-theoretic framework called rewriting categories to formalize two key symmetry notions: proof equivariance, which ensures that proof distributions transform consistently under rewrites, and success invariance, which requires that equivalent statements have the same probability of being solved. The study finds that while state-based next-tactic provers naturally satisfy proof equivariance, LLM-based provers do not, leading to performance variability. To address this issue, the authors propose test-time methods that aggregate over equivalent rewritings of the input, demonstrating that these methods can recover success invariance in the sampling limit and improve robustness and performance under fixed inference budgets. The findings suggest that incorporating symmetry as an inductive bias is crucial for enhancing LLM-based theorem proving.
Methodology
The authors develop a category-theoretic framework called rewriting categories to formalize the concepts of proof equivariance and success invariance. They conduct empirical analyses using a benchmark of semantically equivalent reformulations of formal problems to evaluate the performance of LLM-based provers. Additionally, they propose a model-agnostic test-time procedure that aggregates equivalent rewritings to enhance proof success rates.
Results
The empirical analysis reveals that state-of-the-art LLM-based provers fail to maintain success invariance, showing large performance discrepancies across equivalent formulations. The proposed test-time aggregation method significantly improves the robustness and overall proof success of LLM-based provers under fixed inference budgets, recovering success invariance in the sampling limit.
Implications
The findings highlight the importance of symmetry in formal theorem proving and suggest that incorporating symmetry as an inductive bias can lead to more reliable and effective LLM-based theorem provers. This work opens avenues for further research into enhancing theorem proving systems by addressing representation sensitivity.
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
Large Language Models
Efficient ML
- AutoMCU shifts from proxy-driven hardware-aware search to a feasibility-first approach for neural network customization.
- It employs a hardware-in-the-loop mechanism to filter infeasible architecture candidates before training.
- The system integrates proposal, training, evaluation, and deployment stages in a closed-loop manner.
- AutoMCU significantly reduces customization time compared to traditional methods.
Read more
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
Summary
The paper introduces AutoMCU, a novel system designed to facilitate the deployment of neural networks on microcontroller units (MCUs) by addressing the significant constraints of memory, storage, and computation inherent in these devices. Traditional methods for neural network customization often rely on proxy metrics and involve extensive manual iterations, leading to inefficiencies and high development costs. AutoMCU leverages a feasibility-first approach using a large language model (LLM) within a multi-agent system framework. This system generates structured architecture candidates based on natural language task requirements and hardware specifications, filtering out infeasible designs through vendor toolchain feedback before training. It evaluates feasible models under a controlled protocol and verifies their deployability through backend-grounded analysis. Key innovations include hardware-in-the-loop architecture generation for early elimination of undeployable candidates and state-isolated multi-agent scheduling for efficient coordination across the proposal, training, evaluation, and deployment stages. Experimental results demonstrate that AutoMCU achieves competitive accuracy while significantly reducing customization time to 1-2 hours, compared to hundreds of GPU hours required by existing hardware-aware neural architecture search (HW-NAS) methods. Real-device deployments on STM32 microcontrollers further validate its practical applicability for edge intelligence.
Methodology
AutoMCU utilizes a large language model (LLM) to generate structured architecture candidates based on task requirements and hardware constraints. It incorporates a multi-agent system to coordinate the stages of proposal, training, evaluation, and deployment, ensuring that only feasible designs are pursued. The hardware-in-the-loop mechanism allows for early filtering of candidates based on vendor-specific backend analysis.
Results
Experiments conducted on CIFAR-10 and CIFAR-100 datasets demonstrate that AutoMCU achieves competitive accuracy while reducing the customization time to approximately 1-2 hours. This is a significant improvement over traditional HW-NAS methods, which can take hundreds of GPU hours. Additionally, comparisons with other methods like ColabNAS and GENIUS on NAS-Bench-201 highlight AutoMCU's effectiveness and stability.
Implications
The findings suggest that AutoMCU can streamline the process of deploying neural networks on resource-constrained devices, making it easier for developers to implement edge intelligence applications in various IoT scenarios. This could lead to broader adoption of neural networks in smart homes, industrial automation, healthcare, and agriculture.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
NLP
Large Language Models
Reinforcement Learning
- OPSD can degrade reasoning performance by suppressing uncertainty in token-level supervision.
- DASD introduces entropy-routed supervision, pushing high-entropy tokens away from the teacher and pulling low-entropy tokens towards it.
- DASD achieves superior performance on mathematical reasoning benchmarks compared to traditional self-distillation methods.
- The proposed method preserves exploration while maintaining step-level execution accuracy.
Read more
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
Summary
This paper addresses the limitations of On-policy Self-Distillation (OPSD) in large language models (LLMs), particularly its tendency to suppress predictive uncertainty, which is crucial for complex reasoning tasks. The authors identify that OPSD's uniform teacher supervision across tokens leads to degradation in reasoning performance, especially in high-entropy situations where exploration is necessary. To mitigate this issue, they propose Direction-Adaptive Self-Distillation (DASD), which adapts the direction of supervision based on the entropy of tokens. High-entropy tokens are encouraged to explore alternative paths by moving away from the teacher's guidance, while low-entropy tokens are guided towards the teacher to stabilize execution. The effectiveness of DASD is demonstrated across six mathematical reasoning benchmarks, where it outperforms existing methods by balancing exploration and execution accuracy, particularly in challenging reasoning scenarios.
Methodology
The authors analyze the limitations of OPSD through token-level analysis and propose DASD, which employs an entropy-based approach to direct supervision. High-entropy tokens are treated differently from low-entropy tokens, allowing for a more nuanced training process that enhances both exploration and execution stability.
Results
DASD outperformed strong RLVR and self-distillation baselines across six mathematical reasoning benchmarks, achieving the best macro Avg@16 scores. The method demonstrated significant improvements in Pass@k, reasoning health, and generalization, particularly on more challenging benchmarks that require maintaining multiple solution paths.
Implications
The findings suggest that adaptive supervision strategies can enhance the reasoning capabilities of LLMs, making them more effective in complex problem-solving scenarios. This approach could be applied in various domains requiring advanced reasoning, such as education, automated theorem proving, and complex decision-making systems.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Reinforcement Learning
Large Language Models
Optimization
- VPO focuses on generating diverse solutions rather than converging on a single optimal response.
- The algorithm exploits vector-valued rewards to encourage a range of high-quality trade-offs.
- Empirical results show VPO outperforms scalar RL baselines in test-time search scenarios.
- VPO enables solving problems that traditional methods like GRPO cannot address.
Read more
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Summary
This paper introduces Vector Policy Optimization (VPO), a novel reinforcement learning (RL) algorithm designed to enhance the diversity of solutions generated by language models (LMs) during test-time search. Traditional post-training methods for LMs often focus on optimizing a single scalar reward, which can lead to low-entropy response distributions and a lack of diversity in outputs. VPO addresses this by explicitly training policies to anticipate and produce diverse solutions across multiple reward dimensions, leveraging the fact that many practical tasks involve vector-valued rewards. The authors argue that RL post-training should prioritize generating a diverse set of competent solutions, allowing downstream search mechanisms to exploit this diversity effectively. Through empirical evaluation across four tasks, VPO demonstrates superior performance compared to existing scalar RL baselines, particularly as the search budget increases. The findings suggest that optimizing for diversity should become a standard objective in post-training for LMs, especially in systems that utilize test-time search procedures.
Methodology
VPO combines multi-answer generation with stochastic reward scalarizations to train language models to produce sets of candidate solutions that span the Pareto frontier of different reward dimensions. This method allows the model to maintain a diverse distribution of outputs, which can be effectively utilized by downstream search mechanisms.
Results
VPO achieved performance that matched or exceeded the strongest scalar RL baselines across four diverse tasks, with improvements in metrics such as pass@k and best@k. The performance gap widened with increased search budgets, and VPO successfully solved problems that GRPO could not at any candidate budget.
Implications
The findings suggest that incorporating diversity into the training of language models can significantly enhance their performance in search-augmented systems. This could lead to more effective AI applications in areas requiring complex decision-making and problem-solving, such as coding, navigation, and multi-hop reasoning.
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Reinforcement Learning
Theory
Optimization
- Introduces a model-based algorithm (MB-OCE-VI) for risk-sensitive RL in discounted MDPs.
- Establishes PAC sample complexity bounds for learning optimal policies and value functions under recursive OCE.
- Characterizes conditions under which OCE measures are PAC-learnable.
- Provides lower bounds on sample complexity, highlighting the dependence on effective horizon.
Read more
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Summary
This paper investigates risk-sensitive reinforcement learning (RL) in finite discounted Markov Decision Processes (MDPs) using a generative model. The authors focus on a family of risk measures known as optimized certainty equivalents (OCE), which includes significant measures like entropic risk, Conditional Value-at-Risk (CVaR), and mean-variance. The study aims to characterize the sample complexities associated with learning optimal state-action value functions and policies under recursive OCE. The authors present a model-based algorithm, Model-Based OCE Value Iteration (MB-OCE-VI), and derive PAC sample complexity bounds for both value and policy learning. They establish that OCE measures defined by utility functions with full domains are PAC-learnable, while those without are not. Additionally, the paper provides lower bounds for sample complexity, revealing the dependence on the effective horizon and improving existing bounds for CVaR. This work represents a significant advancement in understanding the sample complexity of recursive OCE in discounted MDPs and presents the first impossibility results for RL under OCEs.
Methodology
The authors propose a model-based approach for risk-sensitive RL using a generative model of the MDP. They derive PAC sample complexity bounds for both value and policy learning under recursive OCE, focusing on utility functions with full domains. The methodology includes establishing impossibility results for non-PAC learnable cases and deriving lower bounds for sample complexity.
Results
The paper presents exact PAC sample complexity bounds for the proposed algorithm, demonstrating optimal dependence on the size of the state-action space. It also establishes lower bounds that reveal the complexity of learning under recursive OCE, including explicit dependencies on the effective horizon and improvements over existing bounds for CVaR.
Implications
The findings have significant implications for high-stakes applications in finance, healthcare, and operations research, where risk-sensitive decision-making is crucial. The results can guide the development of more efficient RL algorithms that account for risk in uncertain environments.
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Time Series
- Climate emulation is fundamentally an out-of-distribution prediction task.
- Seasonal variations can effectively proxy long-term climate shifts for evaluation purposes.
- Current hybrid-ML emulators show significant performance degradation under realistic distribution shifts.
- Compositional generalisation is crucial for enhancing the robustness of climate emulators.
Read more
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Summary
This paper addresses the challenge of climate emulation as an out-of-distribution (OOD) projection task, highlighting the inadequacies of current machine learning (ML) methods in predicting future climate scenarios. The authors confirm that climate change leads to significant shifts in atmospheric state distributions, rendering traditional evaluation protocols insufficient. They propose that seasonal variations can serve as effective proxies for these long-term shifts, allowing for a more realistic assessment of emulator robustness without the need for synthetic perturbations. A novel evaluation framework is introduced, leveraging these seasonal shifts to quantify the performance degradation of state-of-the-art hybrid-ML emulators under realistic distribution shifts. The study identifies compositional generalisation as a key requirement for improving robustness and demonstrates that physically motivated model decompositions can enhance OOD performance with minimal trade-offs against in-distribution accuracy. Overall, the work emphasizes the need for a paradigm shift in the design and evaluation of climate emulators to ensure their reliability in the face of an uncertain future.
Methodology
The authors conducted a systematic characterization of climate emulation as an OOD task, utilizing 40 years of observation-constrained reanalysis data. They established seasonal variation as a proxy for long-term climate shifts and developed a zero-overhead evaluation framework to assess emulator robustness. The performance of existing hybrid-ML emulators was quantitatively analyzed under these seasonal shifts, and the concept of compositional generalisation was explored to enhance robustness.
Results
The study found that current state-of-the-art hybrid-ML emulators exhibit significant degradation in performance when subjected to realistic seasonal distribution shifts. The introduction of physically motivated decompositions improved OOD performance, demonstrating that such approaches can lead to more robust climate emulators without severely compromising in-distribution accuracy.
Implications
The findings suggest that future climate emulators must be designed with an emphasis on robustness to OOD scenarios, particularly in the context of climate change. The proposed evaluation framework can guide the development of more reliable ML-driven climate models, which are crucial for effective climate policy and risk management.
Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs
NLP
Large Language Models
Graph Learning
- Ex-GraphRAG introduces M-GNAN for exact node-level attribution in graph-augmented LLMs.
- The framework uncovers a semantic-structural mismatch in evidence routing, affecting multi-hop QA performance.
- Ex-GraphRAG matches the performance of traditional black-box GNN encoders while offering transparency.
- The findings have implications for retrieval pruning and context construction in LLMs.
Read more
Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs
Summary
The paper introduces Ex-GraphRAG, an innovative framework that enhances the interpretability of evidence routing in Graph-Augmented Large Language Models (LLMs). Traditional Graph Retrieval-Augmented Generation (GraphRAG) systems utilize Graph Neural Networks (GNNs) to encode subgraphs from knowledge graphs, but these encoders obscure the contributions of individual nodes, making it difficult to audit the evidence that influences model outputs. Ex-GraphRAG replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), which allows for an exact decomposition of the encoder's output across nodes and feature groups. This advancement not only matches the performance of existing black-box encoders but also provides precise node-level attribution. The authors conduct an audit on a biomedical knowledge graph question-answering dataset, revealing a significant semantic-structural mismatch: the most influential nodes in the encoder's output are often structurally disconnected, relying on low-importance intermediary nodes for connectivity. This finding highlights the need for improved retrieval strategies and context construction in graph-augmented LLMs.
Methodology
The authors developed Ex-GraphRAG by integrating M-GNAN into the G-Retriever framework, allowing for an intrinsic decomposition of the encoder's output. They conducted audits on the STaRK-Prime dataset to evaluate the semantic-structural relationships within retrieved subgraphs and their impact on multi-hop question answering.
Results
Ex-GraphRAG demonstrated comparable performance to existing GNN encoders (GAT, GCN, GIN) while providing exact node-level attribution. The audit revealed that the removal of low-importance intermediary nodes could degrade multi-hop QA performance by up to 28%, indicating a critical mismatch between semantic importance and structural connectivity.
Implications
The findings suggest that improving the transparency of graph encoders can enhance the reliability of LLMs in high-stakes applications, such as biomedical question answering and regulatory compliance. The insights gained from the audits can inform better retrieval strategies and context construction, ultimately leading to more robust AI systems.
Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology
Computer Vision
- Early visual alignment is conserved across human and macaque visual systems.
- Local learning rules (STDP, PC) outperform backpropagation in macaque V1/V2 alignment.
- No detectable correlation in higher-area (IT) rankings across species.
- Model capacity and stimulus domain significantly affect higher-area alignment.
Read more
Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology
Summary
This paper investigates the generalizability of learning rules and brain alignment across species by comparing human fMRI data with macaque electrophysiology. The study extends previous findings that untrained convolutional neural networks (CNNs) align with human visual cortex (V1) by testing five learning rulesβbackpropagation (BP), feedback alignment (FA), predictive coding (PC), spike-timing-dependent plasticity (STDP), and a random-weights baselineβagainst macaque data. Using Representational Similarity Analysis (RSA), the research finds that all models achieve higher alignment with macaque early visual cortex compared to human fMRI, with STDP and PC showing the best performance. However, no correlation in learning rule rankings is observed at higher areas (IT) across species, suggesting that differences may arise from model capacity and stimulus domain rather than inherent properties of the learning rules. The study highlights the robustness of early visual alignment across species while indicating that higher-area alignment is influenced by model architecture and training data richness.
Methodology
The study employs Representational Similarity Analysis (RSA) to compare the alignment of various learning rules using identical model weights across species. It utilizes two macaque datasets for V1/V2 and IT, alongside human fMRI data, to evaluate the performance of five learning rules and a pretrained ResNet-50 model.
Results
The results indicate that all tested models achieve significantly higher alignment with macaque early visual cortex than with human fMRI data. STDP and PC yield the highest alignment scores in macaque V1/V2. However, the rankings of learning rules at the IT level show no correlation between species, and pretrained models outperform custom architectures in alignment metrics.
Implications
These findings suggest that while early visual processing mechanisms may be conserved across species, the complexity of higher-level visual processing may be more dependent on model architecture and training data. This could inform future research in computational neuroscience and the development of more effective neural network architectures for visual tasks.
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
NLP
Large Language Models
Interpretability
- Development of a reproducible audit pipeline for analyzing model failures.
- Identification of feature 17,491 as a correlate of failure, but not a causal factor.
- Demonstration of the importance of conducting controls to validate mechanistic claims.
- Highlighting the lexical confound in the IOI task that significantly affects accuracy.
Read more
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
Summary
This paper presents a reproducible audit of the sparse-autoencoder (SAE) features of GPT-2 Small in the context of the Indirect Object Identification (IOI) task. The authors analyze 300 prompts where GPT-2 Small achieves an accuracy of 79.7%. They identify 146 features that meet a significance threshold, with feature 17,491, labeled 'cryptographic keys', being the strongest correlate of failure. This feature activates significantly more during failed trials when the transferred object is 'the keys', leading to a failure rate of 93.3% for those prompts. The authors conduct three controls to validate their findings: a causal ablation that shows feature 17,491 does not restore accuracy, a representation baseline that matches predictive power with a logistic regression, and a seed-robustness check that reveals variability in top features across different runs. The main contribution is the audit pipeline, which is model-agnostic and surfaces named correlates, rather than a singular feature. The study emphasizes the importance of distinguishing between robust behavioral effects and incidental feature correlations in mechanistic interpretability.
Methodology
The authors employed a sparse-autoencoder (SAE) to analyze the activations of GPT-2 Small on the IOI task. They conducted statistical analyses to identify significant features and performed causal ablation, logistic regression, and seed robustness checks to validate their findings.
Results
The audit revealed that feature 17,491 was the strongest correlate of failure, with a Cohen's d of +2.93, but it failed to act as a sufficient cause when ablated. The logistic regression on the raw residual stream matched the predictive power of the top SAE features, and the failure rate remained consistent across different random seeds, indicating the robustness of the lexical confound.
Implications
The findings suggest that while certain features may correlate with model failures, they do not necessarily indicate causation. This has implications for the interpretability of language models and the methodologies used to analyze their behavior. The audit pipeline can be applied to other models and tasks to enhance understanding of model failures.
Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
Graph Learning
Theory
- Graph tokenization is a fundamental aspect of transformer expressivity, affecting the model's ability to learn from graph data.
- Different tokenizations (spectral, random-walk, adjacency) impose distinct depth requirements for the same graph computation.
- Random-walk tokenization is lossy, while spectral tokenization is ill-conditioned for local tasks, limiting their effectiveness.
- Transformers cannot convert between tokenization families efficiently at limited depths, which restricts their adaptability.
Read more
Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
Summary
This paper investigates the critical role of graph tokenization in the expressivity of transformers applied to graph-structured data. The authors analyze three primary tokenization methods: spectral, random-walk, and adjacency tokenizations, demonstrating that each induces different depth requirements for transformers. They establish that while spectral tokenization is lossless, it is ill-conditioned for local tasks, and random-walk tokenization is inherently lossy, making it impossible to recover the original graph structure. The study reveals that transformers cannot efficiently convert between tokenization families at limited depths, leading to significant implications for model performance. The authors complement their theoretical findings with empirical experiments, showing that different tasks benefit from distinct structural representations and that combining tokenizations can enhance performance.
Methodology
The authors conducted a theoretical analysis of graph tokenizations, proving depth separations and establishing impossibility results for converting between tokenization families. They also performed controlled experiments on synthetic and real-world tasks to validate their theoretical predictions.
Results
The study found that different tokenizations lead to varying depth requirements for transformers, with random-walk tokenization being lossy and spectral tokenization being ill-conditioned for local tasks. The experiments confirmed that combining different tokenizations often resulted in improved performance across tasks.
Implications
The findings suggest that careful selection of graph tokenization is crucial for optimizing transformer performance on graph-structured data. This has implications for the design of graph learning models and their applications in various domains, including social network analysis, molecular chemistry, and recommendation systems.
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
NLP
Large Language Models
Optimization
- Introduces a principled analysis of uncertainty signals in policy optimization.
- Identifies two fundamental limitations of existing entropy-based measures: the anisotropic gap and the calibration gap.
- Proposes GCPO, which integrates geometry-aware measures and reward-based calibration.
- Demonstrates improved alignment and performance in GRPO-style training across multiple tasks.
Read more
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
Summary
This paper addresses the limitations of current entropy-based uncertainty measures in policy optimization for large language models (LLMs). The authors identify two critical gaps in existing methods: the anisotropic gap, where entropy measures fail to capture the geometric magnitude of semantic disagreement, and the calibration gap, where uncertainty is not aligned with reward informativeness. To overcome these issues, they propose a novel framework called Geometric-aware Calibrated Policy Optimization (GCPO), which integrates geometry-aware measures to better characterize uncertainty and regulate gradient variance. The framework employs Cosine Dispersion and Barycentric Transport to capture semantic disagreement and incorporates a Reward Dispersion module to align updates with reward quality. Extensive experiments demonstrate that GCPO consistently improves performance in post-training scenarios across various benchmarks, providing a principled approach to designing uncertainty signals that enhance learning while suppressing noise.
Methodology
The authors conducted both theoretical and empirical analyses to identify gaps in existing entropy-based uncertainty measures. They developed the GCPO framework, which incorporates geometry-aware measures and a reward-based calibration mechanism to improve the regulation of gradient variance and learning signal quality. Experiments were performed on various benchmarks to validate the effectiveness of the proposed method compared to traditional entropy-based approaches.
Results
The experiments showed that GCPO achieved stronger alignment with sample-level gradient variance and consistently outperformed entropy-based baselines in post-training performance across multiple tasks, including question answering and mathematical reasoning.
Implications
The findings suggest that designing uncertainty signals that are aligned with optimization dynamics can lead to more robust post-training strategies for LLMs. This has potential applications in improving reasoning and alignment in AI systems, enhancing their performance in various reasoning-oriented tasks.
Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model
Theory
- Extends Equilibrium Propagation to skew-gradient systems.
- Establishes equivalence between deep Energy-Based Models and Hamiltonian neural networks.
- Demonstrates applicability of EqProp for credit assignment in Fitzhugh-Nagumo networks.
- Derives a layer-wise Hamiltonian recurrence relation for inference.
Read more
Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model
Summary
This paper extends the Equilibrium Propagation (EqProp) framework to skew-gradient systems, establishing an equivalence between deep Energy-Based Models (EBMs) and Hamiltonian neural networks. The study focuses on networks of diffusively coupled Fitzhugh-Nagumo neurons, demonstrating that the stationary solutions of this model can be described using self-adjoint operators, allowing for the application of EqProp for credit assignment. Furthermore, it shows that Fitzhugh-Nagumo networks structured like deep residual networks possess a spatial Hamiltonian, enabling the use of Hamiltonian Echo Backpropagation (HEB) methods. The paper concludes by deriving a layer-wise Hamiltonian recurrence relation for inference in both deep Fitzhugh-Nagumo networks and EBMs, suggesting a novel approach to gradient estimation in biologically plausible neural networks.
Methodology
The paper employs theoretical analysis to extend the EqProp framework, applying it to the Fitzhugh-Nagumo model and deriving Hamiltonian recurrence relations. It utilizes concepts from gradient systems and Hamiltonian dynamics to demonstrate local credit assignment and inference capabilities.
Results
The study successfully shows that the methods of EqProp can be applied to the Fitzhugh-Nagumo model, revealing that stationary solutions can be effectively managed using self-adjoint operators. Additionally, it establishes that these networks can utilize Hamiltonian dynamics for inference, leading to a new understanding of gradient estimation in neural networks.
Implications
The findings suggest that the methods derived could inform the development of more biologically plausible learning algorithms in neural networks, potentially bridging the gap between computational models and biological learning processes. This could lead to advancements in neuro-inspired computing architectures.
Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series
Time Series
- PDFTime decouples temporal representation learning from decision-making in time series classification.
- The framework utilizes a novel prototype-based classification head for structured, similarity-driven inference.
- PDFTime achieves state-of-the-art results on 80 out of 128 datasets in the UCR archive.
- The method enhances both generalization capabilities and interpretability of time series classification models.
Read more
Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series
Summary
The paper introduces PDFTime, a novel prototype-guided framework designed to enhance the generalization and interpretability of time series classification (TSC). Traditional TSC methods often utilize a direct feature-to-label mapping, which conflates feature extraction and decision-making, leading to challenges in both accuracy and interpretability. PDFTime addresses these issues by reformulating TSC as a multi-stage decision process that leverages learned prototypes to approximate class-conditional feature distributions in the latent space. This approach allows for progressive discrimination through classification sub-tasks of varying granularity, facilitating a more interpretable decision-making process. The authors demonstrate that PDFTime achieves state-of-the-art performance across multiple benchmarks, including the UEA and UCR datasets, significantly outperforming existing methods in both consistency and generalization. The framework not only improves classification accuracy but also provides transparent attribution of predictions to specific prototype matches, thereby enhancing interpretability in TSC.
Methodology
PDFTime reformulates time series classification as a multi-stage decision process, utilizing learned prototypes to approximate class-conditional feature distributions. It decomposes the classification task into progressively refined sub-tasks, enabling a structured and interpretable decision-making process. The prototype-based classification head organizes prototypes hierarchically, facilitating multi-granularity reasoning.
Results
PDFTime achieved state-of-the-art performance on 80 out of 128 datasets in the UCR archive, significantly outperforming recent competitive baselines in terms of both consistency and generalization. The framework sets a new performance ceiling for the community in time series classification.
Implications
The proposed framework has significant implications for various applications of time series classification, including healthcare, human action recognition, and Internet of Things (IoT) systems. By enhancing interpretability, PDFTime can help practitioners understand model decisions better, leading to more reliable deployments in critical domains.
Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring
Interpretability
- Developed a machine learning framework for predicting myocardial ischemia from non-contrast CT calcium scoring.
- Utilized 74 variables including clinical data and calcium-omics features for analysis.
- Achieved high precision (98.9%) and significant improvement in predictive performance with calcium-omics features.
- Identified the number of calcified arteries as a strong predictor of myocardial ischemia despite its low SHAP ranking.
Read more
Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring
Summary
This study presents a novel machine learning framework aimed at predicting myocardial ischemia using non-contrast computed tomography calcium scoring (CTCS). The research analyzed data from 1,375 patients who underwent both CTCS and regadenoson stress cardiac positron emission tomography (PET) myocardial perfusion imaging. The study evaluated 74 variables, including clinical data, Agatston scores, and calcium-omics features. Using XGBoost and Shapley Additive exPlanations (SHAP) for feature selection, the model identified key predictors of ischemia. The final model included the Agatston score, eight calcium-omics features, and age, achieving a precision of 98.9%, sensitivity of 79.2%, and an F1 score of 87.7%. Notably, the number of calcified arteries, although ranked low in SHAP analysis, showed a strong association with myocardial ischemia in logistic regression. The findings suggest that incorporating calcium-omics features enhances predictive performance beyond traditional clinical variables and Agatston scoring, indicating a potential for improved cardiovascular risk stratification.
Methodology
The study employed a retrospective analysis of 1,375 patients, utilizing machine learning techniques including XGBoost for feature selection and logistic regression for predictive modeling. The analysis focused on clinical variables, Agatston scores, and calcium-omics features derived from CTCS scans.
Results
The final predictive model achieved a precision of 98.9%, sensitivity of 79.2%, and an F1 score of 87.7%. The addition of calcium-omics features significantly improved the model's predictive performance compared to models using only clinical variables or the Agatston score. The number of calcified arteries was found to have a strong association with myocardial ischemia.
Implications
The findings suggest that quantitative analysis of coronary calcifications using non-contrast CTCS can enhance the prediction of myocardial ischemia, potentially leading to more accessible and effective cardiovascular risk stratification methods.
Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics
Interpretability
- Introduces Holomorphic KAN-ODE, combining KANs with Neural ODEs under Cauchy-Riemann regularization.
- Achieves high accuracy (R2 > 0.95) in modeling six families of complex dynamical systems with significantly fewer parameters than MLPs.
- Successfully recovers symbolic governing equations and reconstructs fractal boundaries with up to 98.0% agreement.
- Demonstrates superior noise resilience and transfer learning capabilities compared to traditional MLPs.
Read more
Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics
Summary
This paper introduces the Holomorphic KAN-ODE framework, which integrates Kolmogorov-Arnold Networks (KANs) with Neural Ordinary Differential Equations (Neural ODEs) while incorporating Cauchy-Riemann equations as a differentiable regularization. The motivation behind this work is to accurately model complex dynamical systems governed by holomorphic maps, which exhibit fractal boundaries and sensitivity to initial conditions. Traditional Multi-Layer Perceptrons (MLPs) used in Neural ODEs fail to respect the complex-analytic geometry and do not provide interpretable governing equations. The proposed KAN-ODE framework replaces MLPs with KANs, which utilize learnable B-spline activations and allow for automatic symbolic regression. The authors evaluate the framework on six families of complex dynamical systems, demonstrating that it achieves high accuracy in modeling, symbolic equation recovery, and noise resilience. The results show that the KAN-ODE framework is not only more parameter-efficient than MLPs but also provides interpretable outputs, making it a valuable tool for the physics-informed discovery of holomorphic dynamics.
Methodology
The Holomorphic KAN-ODE framework employs Kolmogorov-Arnold Networks with learnable B-spline activations, integrating Cauchy-Riemann equations as a differentiable regularization to ensure holomorphic structure. The framework is evaluated on various complex dynamical systems, including polynomial and transcendental classes, through experiments that assess fractal boundary reconstruction, symbolic regression, and noise robustness.
Results
The KAN-ODE framework achieved velocity-field R2 values greater than 0.95 across all tested systems, identified all governing symbolic families, and reconstructed Julia set boundaries with up to 98.0% agreement. It exhibited only 4% mean squared error degradation under 10% observation noise, compared to 15.2Γ degradation for MLPs, and showed a 90.4% improvement in transfer learning from quadratic to cubic dynamics.
Implications
The findings suggest that KANs can serve as a powerful tool for the interpretable modeling of complex dynamical systems, providing insights into the underlying physics and enabling symbolic regression. This approach could be applied in fields such as fluid mechanics, chaos theory, and other areas where complex dynamics are prevalent.
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
NLP
Large Language Models
Theory
- AI text detectors amplify a pretrained typicality axis instead of learning a new AI-vs-human boundary.
- Raw projections from pretrained models can outperform fine-tuned models in discrimination tasks.
- A closed-form Jacobian predictor can effectively manipulate the typicality axis and improve detection rates.
- Calibration shifts account for a significant portion of bias in AI text detection, rather than learned representations.
Read more
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
Summary
This paper investigates the behavior of AI text detectors, specifically how they function not by creating a distinct boundary between AI-generated and human-written text, but by amplifying a pretrained typicality axis. The study demonstrates that raw projections from pretrained encoders can achieve high discrimination performance (AUROC scores of 0.806/0.944/0.834) across three architectures, often exceeding the performance of fine-tuned models. The author argues that fine-tuning does not construct a new decision boundary but rather modifies the existing geometry of the pretrained model, leading to a misclassification of fluent formal human text as more AI-like than it actually is. The paper also introduces a closed-form Jacobian predictor that effectively manipulates the typicality axis, achieving oracle-equivalence across multiple detectors and significantly improving detection rates. The findings suggest that calibration shifts, rather than learned representations, account for biases in AI text detection, with implications for classifier fairness across various applications.
Methodology
The study employs a combination of raw encoder projections, fine-tuning experiments, and the development of a closed-form Jacobian predictor to analyze the behavior of AI text detectors. It evaluates the performance of various architectures (ELECTRA, RoBERTa-base, DeBERTa-v3) and uses AUROC scores to measure discrimination capabilities across different text populations.
Results
The paper reports AUROC scores of 0.806/0.944/0.834 for raw projections across three architectures, with fine-tuning often resulting in lower discrimination performance. The closed-form predictor achieves oracle-equivalence across three third-party detectors, improving the OpenAI detector's false positive rate by 57%. The findings confirm that calibration shifts are responsible for biases in AI text detection.
Implications
The results have significant implications for the design and evaluation of AI text detectors, suggesting that improvements should focus on addressing calibration shifts rather than solely relying on fine-tuning. This understanding can also inform fairness considerations in other classifier applications, potentially leading to more equitable AI systems.
Optimal Guarantees for Auditing RΓ©nyi Differentially Private Machine Learning
Theory
- Introduces a new auditing framework for RΓ©nyi differential privacy based on hypothesis testing.
- Establishes explicit non-asymptotic confidence intervals for RDP auditing using DV estimators.
- Proves optimal sample-complexity guarantees for auditing RDP, showing minimax optimality.
- Empirical results demonstrate significant improvements over prior black-box auditing methods.
Read more
Optimal Guarantees for Auditing RΓ©nyi Differentially Private Machine Learning
Summary
This paper addresses the challenge of auditing machine learning algorithms that claim to provide RΓ©nyi differential privacy (RDP) guarantees. The authors introduce a novel auditing framework based on hypothesis testing that utilizes the DonskerβVaradhan (DV) variational estimator to directly estimate RΓ©nyi divergence between outputs of neighboring executions. They derive explicit non-asymptotic confidence intervals for RDP auditing, effectively separating statistical estimation error from algorithmic privacy leakage. The authors prove matching minimax lower bounds, establishing that their sample-complexity guarantees are optimal, thus providing the first optimal guarantees for auditing RDP via DV estimators. Empirical validation is conducted on DP-SGD in a black-box setting, demonstrating significant improvements in RDP lower bounds compared to existing methods, particularly at small and moderate RΓ©nyi orders where auditing is most challenging. This work not only advances the theoretical foundations of privacy auditing but also enhances practical auditing methods for machine learning systems.
Methodology
The authors utilize a hypothesis testing framework to audit RDP by estimating RΓ©nyi divergence using the Donsker-Varadhan variational representation. They derive non-asymptotic confidence intervals for class-restricted DV estimators and validate their theoretical findings through empirical audits on DP-SGD in a black-box setting.
Results
The proposed auditing framework yields optimal guarantees for RDP auditing, with empirical results showing substantial improvements in estimating RDP lower bounds across datasets like MNIST and CIFAR-10, particularly in challenging privacy regimes.
Implications
This work has significant implications for the reliability of privacy auditing in machine learning, enhancing trust in deployed systems and providing a robust framework for verifying privacy guarantees in practice. It can be applied to various machine learning algorithms that utilize differential privacy.
Hierarchical Variational Policies for Reward-Guided Diffusion
Generative Models
Computer Vision
Efficient ML
- Introduces a unified framework for test-time guidance in diffusion models using hierarchical variational policies.
- Develops Amortized HVP (AHVP) for efficient generation of reward-aligned samples with a single forward pass.
- Presents Semi-Amortized HVP (SHVP) that combines amortized proposals with test-time refinement for improved perceptual quality.
- Demonstrates superior quality-speed tradeoff on inverse problems, achieving over 5Γ faster inference than leading methods.
Read more
Hierarchical Variational Policies for Reward-Guided Diffusion
Summary
This paper presents a novel framework for adapting pretrained diffusion models to various downstream tasks, particularly inverse problems, by utilizing hierarchical variational policies. The authors address the challenge of expensive test-time guidance and optimization by introducing a lightweight stochastic policy that amortizes control during inference. This approach allows for few-step diffusion sampling, where larger step sizes facilitate faster inference while maintaining sample quality through structured control. The proposed Amortized Hierarchical Variational Policy (AHVP) achieves a significant quality-speed tradeoff, outperforming existing methods in terms of perceptual quality and inference speed. Additionally, the Semi-Amortized Hierarchical Variational Policy (SHVP) combines inexpensive amortized proposals with limited test-time optimization, achieving state-of-the-art results on several challenging inverse problems. The framework is modular and can be adapted to a variety of tasks with differentiable rewards, demonstrating its versatility and effectiveness in generating high-quality samples at reduced computational costs.
Methodology
The authors formulate test-time adaptation as a hierarchical variational model, where control is embedded in a lightweight stochastic policy. They employ variational inference to approximate the posterior over denoising trajectories, training the policy to steer the denoising process without requiring expensive optimization at each step. The method consists of a two-stage procedure: first learning an initial noise distribution that maximizes the reward, followed by training per-step stochastic controllers.
Results
The proposed methods (AHVP and SHVP) demonstrate a favorable quality-speed tradeoff across multiple inverse problems, achieving better perceptual quality with significantly reduced inference times compared to existing baselines. For instance, on 4Γ super-resolution tasks, AHVP achieves perceptual quality improvements while being over 5Γ faster than the best-performing baseline.
Implications
The proposed framework has potential applications in various generative tasks, particularly in scenarios requiring real-time processing or high-resolution outputs. It can be utilized in fields such as computer vision, image restoration, and other areas where efficient sample generation is critical.
Bandit Convex Optimization with Gradient Prediction Adaptivity
Optimization
Theory
- Introduces Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) for improving regret bounds in BCO.
- Establishes a fundamental lower bound for prediction-adaptive regret in BCO.
- Demonstrates that the variance of gradient estimation can obscure the benefits of accurate predictions.
- Develops adaptive algorithms that do not require prior knowledge of prediction error or time horizon.
Read more
Bandit Convex Optimization with Gradient Prediction Adaptivity
Summary
This paper explores the field of Bandit Convex Optimization (BCO), focusing on the potential of optimistic gradient predictions to enhance worst-case regret guarantees in a prediction-adaptive manner. The authors establish a negative result indicating that under a single-point feedback protocol, a lower bound of β¦(βT) regret persists, even when the cumulative prediction error is small. To address this, they introduce the Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) algorithm, which utilizes a novel variance-reduced gradient estimator that scales with the prediction error rather than the gradient norm. This approach yields a regret bound of O(βd E[ST]), where d is the decision dimension. The authors also provide an information-theoretic lower bound of β¦(βE[ST]), demonstrating that TP-VR-OPT is optimal up to a factor of βd. Furthermore, they develop adaptive variants that do not require prior knowledge of E[ST] or the horizon T, and extend their framework to non-stationary environments, achieving dynamic regret guarantees that adapt to both cumulative prediction error and comparator path length.
Methodology
The authors propose a new algorithm, TP-VR-OPT, which employs a variance-reduced gradient estimator that adjusts based on the prediction error. This method is designed for the two-point feedback setting, allowing for improved regret bounds. They also derive theoretical guarantees for both static and dynamic regret, analyzing the performance of their approach in various scenarios, including non-stationary environments.
Results
The TP-VR-OPT algorithm achieves a regret bound of O(βd E[ST]), which is significantly better than the traditional O(βT) bounds under certain conditions. The authors also present an information-theoretic lower bound that confirms the optimality of their approach up to a factor of βd. Additionally, the adaptive variants of the algorithm maintain performance without requiring prior knowledge of the prediction error or time horizon.
Implications
The findings suggest that incorporating gradient prediction into bandit convex optimization can lead to significantly improved performance, particularly in environments where the loss functions exhibit predictable patterns. This work opens avenues for more efficient online learning algorithms that can adapt to varying levels of predictability in real-world applications.
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
NLP
Large Language Models
Theory
- Introduction of Energy-Gated Attention (EGA) as a modification to transformer attention.
- EGA improves validation loss significantly with minimal parameter overhead.
- The method is grounded in turbulence theory and signal processing principles.
- Identifies learned wavelet packets as a promising direction for future research.
Read more
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
Summary
This paper introduces Energy-Gated Attention (EGA), a novel modification to the standard transformer attention mechanism that incorporates spectral salience as an inductive bias. Traditional transformers treat all tokens equally, failing to account for their intrinsic informational content. Drawing inspiration from turbulent fluid dynamics, where coherent structures dominate energy distribution, EGA posits that tokens with higher informational density should receive more attention. The method involves gating value aggregation based on the spectral energy of key token embeddings, which is computed through a learned linear projection. EGA demonstrates significant improvements in validation loss on the TinyShakespeare and Penn Treebank datasets, achieving a +0.103 and +0.101 reduction in loss, respectively, with minimal parameter overhead (< 0.26%). The study also reveals that the optimal energy direction is data-adaptive and identifies learned wavelet packets as a promising area for future exploration. The findings suggest that EGA could enhance long-context efficiency in transformers, although empirical validation of this hypothesis is left for future work.
Methodology
EGA modifies the standard transformer attention mechanism by incorporating a learned energy gate that adjusts the attention weights based on the spectral energy of token embeddings. This is achieved through a linear projection that identifies the dominant spectral mode of the embedding field, allowing the model to focus on high-energy, informative tokens while suppressing low-energy background tokens.
Results
EGA achieved a +0.103 validation loss improvement on the TinyShakespeare dataset and +0.101 on the Penn Treebank dataset, demonstrating consistent performance across different datasets and initializations. The method incurs less than 0.26% additional parameter overhead and shows no measurable increase in computational cost.
Implications
The introduction of EGA could lead to more efficient transformer models that better capture the salient information in text, potentially improving performance in various NLP tasks. The findings also suggest avenues for further research into adaptive attention mechanisms and the role of spectral analysis in language modeling.
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Theory
Efficient ML
Generative Models
- Decomposes KL divergence between GP and LNP into three interpretable components.
- Identifies the decay rates of the bottleneck term based on kernel types and representation dimensions.
- Characterizes label contamination as a persistent cost in neural process uncertainty estimation.
- Offers architectural recommendations for variance prediction and aggregation methods.
Read more
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Summary
This paper investigates the costs associated with amortizing Gaussian process (GP) inference using latent neural processes (LNP). Traditional GP inference is computationally expensive, scaling cubically with the number of context points, which limits its application in real-time scenarios. Neural processes offer a solution by learning a mapping from context sets to predictive distributions in linear time. However, this amortization introduces approximation errors that are not present in sparse GP frameworks. The author decomposes the Kullback-Leibler (KL) divergence between the GP and LNP predictives into three distinct components: label contamination, information bottleneck, and amortization error. The analysis reveals that the bottleneck term decays exponentially with the representation dimension for squared-exponential kernels and polynomially for MatΓ©rn kernels, linking architectural choices to kernel smoothness. The label contamination term remains constant, indicating a persistent cost in uncertainty estimation. The paper also provides architectural recommendations to improve predictive variance estimation and suggests replacing mean aggregation with second-order pooling to mitigate the dominant amortization gap.
Methodology
The paper employs a theoretical analysis of the KL divergence between Gaussian processes and latent neural processes. It derives bounds on the divergence components, linking them to architectural choices and kernel properties. The analysis includes a detailed examination of the representation dimension's impact on predictive performance and uncertainty estimation.
Results
The study provides a quantitative characterization of the approximation errors introduced by amortization in neural processes. It establishes that the bottleneck term decays exponentially for squared-exponential kernels and polynomially for MatΓ©rn kernels, while the label contamination term remains constant. The findings lead to architectural recommendations that enhance predictive variance estimation and suggest alternative aggregation methods to reduce errors.
Implications
The insights from this paper can guide the design of neural processes for various applications requiring efficient GP inference, such as robotics, real-time decision-making, and simulation-based inference. By understanding the costs of amortization, practitioners can make informed choices about model architecture and improve predictive performance.
Efficient Higher-order Subgraph Attribution via Message Passing
Graph Learning
Interpretability
Efficient ML
- Introduction of subgraph GNN-LRP (sGNN-LRP) for efficient subgraph attribution.
- Reduction of computational complexity from exponential to linear time with respect to network depth.
- Utilization of message passing techniques to derive the new propagation rule.
- Generalization of subgraph attribution to include neighboring graph features.
Read more
Efficient Higher-order Subgraph Attribution via Message Passing
Summary
This paper addresses the challenge of explaining Graph Neural Networks (GNNs) through higher-order interpretation schemes, specifically focusing on GNN-LRP (layer-wise relevance propagation for GNN). Traditional methods, while effective, suffer from exponential complexity when attributing relevance to subgraphs, making them impractical for larger graphs and deeper networks. The authors propose a novel algorithm, subgraph GNN-LRP (sGNN-LRP), which leverages message passing techniques to compute subgraph attributions in linear time relative to the network depth. This approach significantly reduces computational overhead by directly calculating the relevance of subgraphs in a single backpropagation pass, thus enhancing scalability and efficiency. The paper also introduces a generalized version of subgraph attribution that incorporates neighboring graph features, further improving the interpretability of GNNs. Experimental results demonstrate the effectiveness and speed of the proposed methods, showcasing their potential to facilitate deeper insights into GNN predictions while maintaining computational feasibility.
Methodology
The authors developed the sGNN-LRP algorithm using message passing techniques, specifically a sum-product approach that allows for the direct computation of subgraph relevance. This method contrasts with traditional GNN-LRP, which aggregates relevance over exponentially many walks. The new algorithm is designed to operate in linear time concerning the depth of the GNN, making it more efficient for larger and deeper networks.
Results
The proposed sGNN-LRP algorithm significantly accelerates the computation of subgraph attributions, demonstrating linear time complexity in relation to network depth. Experimental results indicate that the method is not only faster but also maintains high accuracy in relevance attribution, thereby enhancing the interpretability of GNNs.
Implications
The findings suggest that the sGNN-LRP method can be applied to various domains where GNNs are utilized, such as social network analysis and molecular property prediction. By improving the interpretability of GNNs, this work could facilitate their adoption in critical applications requiring transparency and trustworthiness.
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Graph Learning
Optimization
Efficient ML
- RNS can match or outperform full-graph training on 8 out of 10 datasets.
- Backward error analysis shows that mini-batch SGD implicitly minimizes a modified objective.
- RNS provides lower variance in per-batch gradients compared to structure-aware samplers.
- RNS is computationally efficient, achieving 2Γ to 12Γ speedups and up to 3Γ lower peak GPU memory usage.
Read more
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Summary
This paper investigates the implications of mini-batch training in Graph Neural Networks (GNNs), particularly focusing on Random Node Sampling (RNS). Unlike traditional i.i.d. data training, mini-batch training on graphs alters the topology and introduces boundary effects, which can complicate the optimization process. The authors demonstrate that RNS, which involves training on subgraphs induced by uniformly sampled nodes, can match or outperform full-graph training across multiple datasets while significantly reducing computational time and memory usage. Through backward error analysis, the paper reveals that mini-batch Stochastic Gradient Descent (SGD) implicitly minimizes a modified objective that combines the sampled loss with a regularization term proportional to the variance of mini-batch gradients. This finding positions RNS as a theoretically grounded and effective method for scalable GNN training, providing a new perspective on the role of graph samplers in optimization dynamics.
Methodology
The authors applied backward error analysis to the mini-batch training of GNNs, specifically focusing on the effects of Random Node Sampling (RNS). They compared RNS with other sampling strategies across various datasets and GNN architectures to evaluate performance and computational efficiency.
Results
The results indicate that RNS not only matches but often exceeds the performance of full-graph training in terms of predictive accuracy across diverse datasets. Additionally, RNS significantly reduces the computational resources required, demonstrating its effectiveness as a scalable training method for GNNs.
Implications
The findings suggest that RNS can serve as a robust default method for training GNNs, particularly in scenarios where computational efficiency is critical. This work also opens avenues for further research into the optimization dynamics of graph-based learning and the role of sampling strategies in model performance.
Can Transformers Learn to Verify During Backtracking Search?
Theory
Large Language Models
Optimization
- Transformers struggle with state-local decision-making due to history entanglement and scattered retrieval of state features.
- Selective State Attention (SSA) is introduced as a structural fix to enforce state-based decision-making.
- SSA allows transformers to make consistent decisions based solely on the current search state, improving performance in backtracking search tasks.
- The study emphasizes the need for structural modifications in transformer models to enhance their reasoning capabilities.
Read more
Can Transformers Learn to Verify During Backtracking Search?
Summary
This paper investigates the ability of transformer models to learn verification processes during backtracking search, a fundamental algorithmic technique used in constraint solvers and theorem provers. The authors highlight that traditional training methods for transformers, which utilize cumulative traces of decisions, lead to two significant issues: scattered retrieval of state features and history entanglement, where the model's predictions depend on the trajectory rather than the current state. To address these challenges, the authors propose Selective State Attention (SSA), a structural modification that ensures the model's decisions are based solely on the current state and not influenced by previous decisions. The study focuses on reactive verification scenarios and evaluates SSA on various problems, including 3-SAT and graph coloring. The findings reveal that SSA significantly improves the model's performance in making consistent decisions based on the current state, demonstrating the importance of structural adjustments in transformer architectures for effective reasoning tasks.
Methodology
The authors implemented Selective State Attention (SSA) as a fixed attention mask in decoder-only transformers, allowing the model to focus on the current decision block while ignoring prior decisions. They conducted experiments on multiple reasoning tasks, including 3-SAT, graph coloring, and backtracking parsing, to evaluate the effectiveness of SSA compared to traditional training methods that utilize cumulative traces.
Results
The introduction of SSA led to improved decision-making in transformer models, particularly in scenarios where the model's prior history could mislead its predictions. The experiments demonstrated that SSA-enabled models produced identical decisions for same-state pairs that differed only in prior history, while traditional models failed to do so. This indicates that SSA effectively isolates state features, leading to more reliable verification during backtracking search.
Implications
The findings suggest that transformer models can be enhanced for reasoning tasks through structural modifications, potentially improving their application in various domains such as automated theorem proving, constraint satisfaction problems, and planning. The analysis also opens avenues for inference-time context clearing as a method to apply similar isolation techniques without retraining.
Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets
Interpretability
- Introduces a formal theorem linking multicollinearity to explainability fragility in AI models for intrusion detection.
- Proposes the Explanability Fragility Score to measure instability in feature attributions.
- Presents two novel methods (CAA-Filtering and SHARP) to mitigate explainability fragility.
- Demonstrates the impact of multicollinearity on feature importance and explanation stability using the UNSW-NB15 dataset.
Read more
Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets
Summary
This paper addresses a critical vulnerability in AI explainability within intrusion detection systems (IDS) caused by multicollinearity among features in public benchmark datasets. The authors introduce a formal theorem that establishes how multicollinearity inflates attribution variance, leading to unreliable explanations and feature importances. They validate this theorem through comprehensive experiments on the UNSW-NB15 dataset, evaluating four model families (linear, tree-based, kernel, and neural) under varying feature sets. The paper proposes a novel Explanability Fragility Score to quantify instability in explanations and introduces two mitigation methods: CAA-Filtering, which stabilizes explanations by grouping attributions, and SHARP, a training-time regularization framework that penalizes attribution instability. The findings demonstrate that addressing multicollinearity can enhance the stability of predictive performance, with implications for the trustworthiness and reproducibility of explainable AI in security-critical contexts. The authors provide guidelines for incorporating multicollinearity mitigations into IDS pipelines, advancing methodological rigor in IDS research.
Methodology
The authors conducted a comprehensive analysis of multicollinearity effects on explainability in AI models by introducing an axiomatic framework. They performed experiments on the UNSW-NB15 dataset, applying various models and evaluating their performance under conditions of multicollinearity. They developed the Explanability Fragility Score and implemented two mitigation strategies: CAA-Filtering for post-hoc stabilization of attributions and SHARP for regularization during training to control attribution stability.
Results
The experiments confirmed that multicollinearity significantly affects the stability of feature attributions and explanations. The proposed SHARP method allowed for controllable improvements in explainability stability without sacrificing predictive performance. The authors provided empirical evidence showing that their methods could effectively reduce attribution variance, leading to more reliable explanations in AI-driven intrusion detection systems.
Implications
The findings have significant implications for the trustworthiness and reproducibility of explainable AI in cybersecurity, particularly in intrusion detection systems. By addressing multicollinearity, practitioners can enhance the reliability of feature importance claims and improve the overall robustness of AI models in security-critical applications. The guidelines provided can help researchers and practitioners incorporate these considerations into their IDS benchmarking processes.
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Optimization
Multimodal
Theory
- Identifies the limitations of linear scalarization in multi-task RRG through gradient dynamics analysis.
- Introduces CAME-Grad, a new optimizer that enhances multi-task learning without modifying existing architectures.
- Demonstrates significant performance improvements in clinical efficacy for radiology report generation.
- Highlights the importance of balancing clinical supervision with report generation smoothness.
Read more
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Summary
This paper addresses the challenges in multi-task learning for automatic radiology report generation (RRG), particularly the limitations of linear scalarization strategies that fail to balance clinical supervision with report generation smoothness. The authors analyze the failure mechanisms of these strategies through gradient dynamics, identifying a 'Double Dilemma' characterized by drift term deviation and diffusion term decay. To overcome these issues, they propose a novel optimizer called Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad), which enhances optimization dynamics without altering the core architecture of existing models. CAME-Grad employs three main strategies: Conflict-Averse Direction Rectification to mitigate destructive interference, Magnitude-Enhanced Energy Injection to restore gradient magnitude, and Adaptive Gradient Fusion to balance optimal directions with task-specific biases. Experimental results demonstrate that CAME-Grad significantly improves clinical efficacy across multiple RRG methods, achieving an average performance increase of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray datasets. The findings suggest that addressing gradient dynamics can lead to more effective multi-task learning in clinical applications.
Methodology
The authors utilized the stochastic differential equation (SDE) framework to analyze the gradient dynamics of multi-task optimization in RRG. They proposed the CAME-Grad optimizer, which includes Conflict-Averse Direction Rectification, Magnitude-Enhanced Energy Injection, and Adaptive Gradient Fusion to improve optimization outcomes.
Results
CAME-Grad was tested across eight different RRG methods, resulting in an average performance improvement of 2.3% on the MIMIC-CXR dataset and 1.9% on the IU X-Ray dataset, demonstrating its effectiveness as a universal optimizer in enhancing clinical efficacy.
Implications
The findings suggest that optimizing gradient dynamics can significantly enhance the performance of multi-task learning models in clinical settings, potentially leading to more accurate and reliable automated radiology report generation systems.
Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning
Theory
- WTA bottlenecks can enforce the extraction of categorical latent factors in multi-task learning.
- The representation from WTA bottlenecks is a structured permutation of the original latent factors.
- Symbolic representations allow individual neurons to encode specific abstract features.
- Empirical results confirm the theoretical findings across different architectures.
Read more
Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning
Summary
This paper investigates the role of Winner-Take-All (WTA) bottlenecks in deep neural networks, particularly in the context of multi-task learning. The authors demonstrate that WTA bottlenecks can enforce the extraction of categorical latent factors from highly non-linearly entangled data. They prove that under certain conditions, the representation emerging from a WTA bottleneck is a structured permutation of the original latent factors, leading to highly symbolic representations where individual neurons or populations of neurons encode specific abstract features. Empirical results on two datasets support the theoretical findings, showing that even when the network architectures deviate from the assumptions, similar disentangled representations arise. The symbolic representations acquired through this approach are shown to enhance generalization capabilities, suggesting a potential link between symbolic and subsymbolic AI systems.
Methodology
The authors employ a theoretical framework to analyze WTA bottlenecks in deep neural networks, proving that under specific conditions, these bottlenecks facilitate the extraction of categorical latent factors. They also conduct empirical experiments on two datasets to validate their theoretical claims and demonstrate the emergence of symbolic representations.
Results
The study finds that WTA bottlenecks lead to the emergence of highly symbolic and disentangled representations, where individual neurons or groups of neurons represent distinct abstract features. The empirical results corroborate the theoretical assertions, indicating that such representations enhance the generalization performance of the models.
Implications
The findings suggest that integrating WTA-like components in neural networks could bridge the gap between symbolic and subsymbolic AI, potentially leading to more interpretable and generalizable AI systems. This could have applications in various fields requiring robust multi-task learning and representation learning.
MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy
Robotics
Computer Vision
Optimization
- MoSA targets residual anisotropy and heterogeneity in real-world dynamics by augmenting an isotropic model with a physics-informed residual stress adaptation module.
- The framework employs motion-constrained optimization to provide more direct supervision, improving data efficiency and reducing overfitting.
- Experimental results indicate that MoSA significantly enhances accuracy, generalization, and robustness in learning dynamics from visual data.
- The approach preserves physical inductive bias and interpretability while effectively capturing subtle residual effects.
Read more
MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy
Summary
The paper introduces MoSA, a novel framework designed to address the real-to-sim gap in continuum dynamics by learning residual anisotropy. Traditional methods often rely on isotropic models that fail to capture the mild anisotropy and heterogeneity present in real-world materials. MoSA builds upon a calibrated isotropic backbone and incorporates a structured residual stress adaptation operator to progressively correct stress responses, thereby enhancing simulation fidelity while maintaining physical interpretability. Additionally, the framework employs motion-constrained optimization, leveraging dynamic 3D reconstructions to provide higher-order supervision for the temporal and spatial derivatives of deformation fields. This approach improves data efficiency and reduces overfitting in video-based dynamics learning. Experimental results demonstrate that MoSA achieves superior accuracy, generalization, and robustness compared to existing methods, effectively learning physically meaningful residual anisotropy. The practical implications of this improved modeling are validated in a robot manipulation setting, showcasing enhanced sim-to-real transfer capabilities.
Methodology
MoSA utilizes a physics-informed residual stress adaptation operator to correct an isotropic constitutive model, capturing mild anisotropy and heterogeneity. It incorporates motion constraints derived from dynamic 3D reconstructions to supervise the temporal and spatial derivatives of the deformation field, enhancing the learning process.
Results
The proposed framework outperforms existing methods in terms of accuracy, generalization, and robustness across both synthetic and real datasets. The model successfully learns physically meaningful residual anisotropy, leading to improved performance in sim-to-real transfer tasks.
Implications
MoSA's advancements in modeling real-world dynamics have significant implications for applications in robotics, graphics, and the development of interactive digital twins, enabling more reliable simulations and manipulations of physical objects.
EntmaxKV: Support-Aware Decoding for Entmax Attention
NLP
Large Language Models
Efficient ML
- EntmaxKV enables efficient sparse decoding by exploiting the sparsity of Ξ±-entmax attention before loading KV pages.
- The framework achieves exact support recovery, avoiding the probability mass loss associated with softmax truncation.
- Empirical results show significant speedups and reduced output errors compared to traditional softmax-based sparse decoding methods.
- The introduction of a Gaussian-aware selector enhances the adaptability of the candidate selection process.
Read more
EntmaxKV: Support-Aware Decoding for Entmax Attention
Summary
The paper introduces EntmaxKV, a novel framework for sparse decoding in the context of long-context autoregressive models utilizing Ξ±-entmax attention. Traditional KV-cache memory traffic during decoding is a significant bottleneck, especially as context length increases. Existing sparse decoding methods are primarily designed for softmax attention, which inherently discards nonzero probability mass during truncation. In contrast, Ξ±-entmax produces exact zeros, allowing for precise support recovery in sparse decoding. The authors propose a method that leverages query-aware page scoring and support-aware candidate selection to minimize memory traffic before loading KV pages. The framework is designed to retain more support tokens while dropping less probability mass, thus achieving lower output error compared to softmax-based methods. The paper also introduces a Gaussian-aware selector that dynamically adjusts the candidate budget based on page statistics. Empirical results demonstrate that EntmaxKV closely matches the performance of full-cache entmax while significantly reducing the required KV cache size, achieving speedups of up to 5.43x over full attention baselines at a context length of 1 million tokens.
Methodology
The authors developed EntmaxKV as a paged selective decoding method that utilizes query-aware metadata for scoring KV pages. The framework employs support-aware candidate selection and sparse entmax attention to optimize memory usage and decoding efficiency. A Gaussian-aware selector estimates the entmax threshold based on lightweight page statistics, allowing for dynamic adjustment of the candidate budget.
Results
EntmaxKV demonstrated superior performance in retaining support tokens and minimizing dropped probability mass compared to softmax-based methods. It achieved lower output error and matched the performance of full-cache entmax while using a fraction of the KV cache, resulting in speedups of up to 3.36x (softmax) and 5.43x (entmax) at a context length of 1 million tokens.
Implications
The findings suggest that EntmaxKV can significantly enhance the efficiency of long-context language models, making them more viable for applications requiring extensive context processing, such as document analysis and conversational AI. This approach could lead to advancements in the deployment of large language models in resource-constrained environments.
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
Large Language Models
NLP
Interpretability
- SVD of the lm_head weight matrix reveals interpretable semantic subspaces without model inference.
- Different LLMs exhibit systematic differences in vocabulary clustering and training data composition.
- Ethically concerning vocabulary subspaces are rooted in pretraining data and persist through post-training alignment.
- The study introduces VCS and WPS as new metrics for evaluating vocabulary coherence and detecting glitch tokens.
Read more
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
Summary
This paper introduces a novel method for analyzing the weight matrix of the output projection layer (lm_head) of transformer-based large language models (LLMs) using singular value decomposition (SVD). The approach requires only five lines of PyTorch code and no model inference, allowing researchers to uncover interpretable semantic subspaces directly from the model weights. The study examines three models: GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, revealing systematic differences in their singular value spectra and vocabulary cluster structures. Key findings include the identification of ethically concerning vocabulary subspaces that persist despite post-training alignment, suggesting that these issues originate in pretraining data. The paper introduces two new metrics: the Vocabulary Cluster Score (VCS) for quantifying subspace coherence and the Weighted Projection Score (WPS) for detecting glitch tokens. The authors advocate for the adoption of lm_head SVD analysis as a standard pre-release safety auditing step, proposing that it can guide tokenizer optimization and improve LLM design.
Methodology
The methodology involves applying singular value decomposition (SVD) to the lm_head weight matrix of transformer-based LLMs. This analysis is performed using five lines of PyTorch code, allowing for the extraction of left singular vectors that indicate the most likely vocabulary tokens based on the model's hidden state alignment.
Results
The analysis of the three models revealed distinct vocabulary cluster structures: GPT-OSS-120B showed a graduated hierarchy of functional subspaces, Gemma-2-2B was dominated by historical English orthography, and Qwen2.5-1.5B exhibited broad multilingual coverage with ethically inappropriate vocabulary. The study found that post-training alignment does not resolve pretraining-level issues, and the proposed metrics (VCS and WPS) successfully identified problematic vocabulary and glitch tokens.
Implications
The findings suggest that SVD analysis of the lm_head can serve as a crucial tool for pre-release safety auditing of LLMs, helping to identify and mitigate ethically concerning content. Additionally, the proposed metrics can guide improvements in tokenizer design and enhance the controllability of LLM outputs.
Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines
Theory
Efficient ML
- AML outperforms standard baselines like CNNs in small to medium image datasets.
- In tabular data, AML is competitive with methods like LightGBM and random forests, though XGBoost remains the top performer.
- AML does not require cross-validation or hyperparameter tuning, making it advantageous in low-data regimes.
- The study provides empirical evidence that symbolic learning can be competitive in supervised tasks.
Read more
Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines
Summary
This paper evaluates Algebraic Machine Learning (AML), a framework that utilizes subdirect decomposition of algebraic structures instead of numerical optimization, against standard machine learning baselines on image and tabular classification tasks. The authors demonstrate that AML can outperform traditional methods, such as convolutional neural networks (CNNs) and boosted trees, particularly in small to medium-sized datasets (50-2000 training examples). AML's performance is notable as it does not require cross-validation or hyperparameter tuning, which are common in conventional methods. The study shows that AML achieves competitive results across diverse datasets, suggesting that symbolic learning can be effective in realistic supervised tasks when the symbolic structure is learned from data rather than manually specified. The findings indicate that AML is particularly strong in low-data scenarios, challenging the assumption that symbolic methods are inferior to modern numerical learners.
Methodology
The authors conducted systematic empirical evaluations of AML on supervised classification tasks using both image and tabular datasets. They compared AML's performance against strong cross-validated baselines, including CNNs for images and XGBoost, LightGBM, and random forests for tabular data. The evaluation focused on training sets ranging from 50 to 2000 examples, assessing the effectiveness of AML's algebraic inductive bias without task-specific hyperparameter tuning.
Results
Across twelve standard image datasets, AML with a logistic regression readout was the best-performing method overall and statistically distinguishable from each baseline. In the tabular datasets, while XGBoost was the best overall, AML showed comparable performance to LightGBM and random forests, achieving the best results on several datasets with 1000 training examples. These results indicate that AML can effectively compete with traditional methods in low- and medium-data scenarios.
Implications
The findings suggest that symbolic learning frameworks like AML can be viable alternatives to traditional machine learning methods, particularly in situations where data is limited. This could lead to new approaches in machine learning that leverage algebraic structures, potentially expanding the applicability of symbolic methods in various domains.
HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
Reinforcement Learning
Large Language Models
NLP
- HealthCraft is the first public RL environment specifically designed for emergency medicine.
- The environment incorporates a FHIR R4 world state and a dual-layer safety rubric.
- A benchmark of 195 tasks with 2,255 criteria, including 515 safety-critical criteria, is established.
- Testing reveals significant safety-failure rates in frontier LLMs under clinical pressure.
Read more
HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
Summary
HealthCraft introduces a novel reinforcement learning (RL) environment tailored for emergency medicine, addressing the inadequacies of existing medical-QA benchmarks in evaluating large language models (LLMs) under clinical pressure. The environment is built on a FHIR R4 world state, featuring 14 entity types and 3,987 seed entities, and includes 24 MCP tools for interaction. A dual-layer rubric ensures that any violation of safety-critical criteria results in a zero reward, emphasizing trajectory-level safety. The paper presents a benchmark of 195 tasks across six categories, graded against 2,255 binary criteria, of which 515 are safety-critical. Results from testing two frontier models, Claude Opus 4.6 and GPT-5.4, reveal significant safety-failure rates and performance collapse in multi-step workflows, highlighting the need for robust evaluation infrastructure. The findings underscore the importance of infrastructure fidelity in model assessment and the necessity for a dedicated RL environment in safety-critical clinical settings. HealthCraft is released under Apache 2.0, providing a comprehensive evaluation framework for future research.
Methodology
The methodology involves adapting the Corecraft architecture to create a reinforcement learning environment that simulates emergency medicine scenarios. It utilizes a FHIR R4 world state and implements a dual-layer rubric with a hard safety gate that nullifies rewards for any safety-critical violations. The environment is Docker-bundled and includes a task engine, rubric grader, and audit logging for tool interactions.
Results
The evaluation of two frontier models, Claude Opus 4.6 and GPT-5.4, showed a Pass@1 rate of 24.8% and 12.6%, respectively, with safety-failure rates of 27.5% and 34.0%. In multi-step workflows, performance dropped to near zero, indicating a critical failure in maintaining safety under sustained clinical pressure. The study also identified six infrastructure bugs that affected model performance assessments.
Implications
HealthCraft has significant implications for the deployment of AI in clinical settings, particularly in emergency medicine. It provides a framework for evaluating LLMs in a safety-critical context, ensuring that models can withstand the pressures of real-world clinical decision-making. This environment can guide future research in AI safety and clinical interoperability.
The Distillation Game: Adaptive Attacks & Efficient Defenses
Theory
Efficient ML
Large Language Models
- Introduces a minimax game framework for analyzing distillation attacks and defenses.
- Demonstrates a significant performance gap between adaptive and passive evaluation methods.
- Develops the Product-of-Experts (PoE) defense, which is computationally efficient and effective.
- Empirical results indicate that adaptive evaluation can increase student accuracy by approximately 50%.
Read more
The Distillation Game: Adaptive Attacks & Efficient Defenses
Summary
This paper addresses the trade-off faced by model providers between the utility of rich outputs and the vulnerability to distillation attacks, which can exploit these outputs for imitation. The authors propose a game-theoretic framework that models the interaction between a utility-constrained teacher (model provider) and an adaptive student (attacker). The framework leads to the development of adaptive evaluation rules and a teacher-side defense strategy. A key contribution is the introduction of the Product-of-Experts (PoE) defense, which combines the teacher model with a proxy student model during output generation. Empirical results demonstrate that adaptive evaluation significantly enhances the effectiveness of attacks, revealing a substantial gap in performance compared to passive evaluation. The PoE defense is shown to be cost-effective, maintaining high-quality reasoning while reducing the computational overhead compared to existing state-of-the-art defenses. Overall, the findings suggest that defenses against distillation should be evaluated against adaptive students to better reflect real-world scenarios.
Methodology
The authors formulate a minimax game where the teacher selects a model output while the student adapts their strategy to focus on high-value examples. They derive adaptive evaluation rules and a defense template based on this framework, leading to the creation of the PoE defense strategy.
Results
The study finds that adaptive evaluation significantly improves the performance of attacks, with adaptive students achieving much higher accuracy than passive evaluations on datasets like GSM8K and MATH. The PoE defense shows a reduced robustness gap compared to more expensive defenses while being cheaper to implement.
Implications
The findings suggest that model providers need to consider the adaptive capabilities of potential attackers when designing defenses against distillation. The PoE defense offers a practical solution for maintaining model integrity while providing rich outputs, which could be crucial for applications in sensitive domains.