AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
72
Papers today
8h
Update frequency
7
Days of history
Discovering Latent Groups for Robust Classification
Interpretability
Optimization
Theory
- NCT framework encodes subgroup structure in a tree architecture for robust classification.
- The model routes samples based on prediction correctness, preserving this structure for interpretability.
- NCT achieves competitive performance on benchmark datasets while isolating minority subgroups.
- The approach does not require subgroup annotations, making it more accessible for practical applications.
Read more
Discovering Latent Groups for Robust Classification
Summary
This paper addresses the challenge of spurious correlations in machine learning models, which often lead to high average accuracy but poor performance on underrepresented subgroups. The authors propose a novel framework called Neural Classification Trees (NCT) that incorporates subgroup structure into its architecture, allowing for robust classification without requiring subgroup supervision. NCT routes samples through a tree structure based on prediction correctness, effectively disentangling conflicting subgroups. The framework preserves the routing paths as part of the inference-time architecture, providing interpretability by mapping model decisions to latent group structures. The authors evaluate NCT on five benchmark datasets, demonstrating that it isolates minority subgroups effectively while achieving competitive robustness compared to state-of-the-art methods. The paper highlights the importance of interpretability in understanding model behavior and the potential of architectural approaches to improve robustness against spurious correlations.
Methodology
The authors introduce Neural Classification Trees (NCT), which utilize a tree-shaped architecture to classify samples based on their routing through 'easy' or 'hard' nodes determined by prediction correctness. This iterative routing creates a partition that is preserved for inference, allowing the model to maintain interpretability and robustness without requiring subgroup annotations. The depth of the tree is determined using a pseudo worst-group accuracy criterion.
Results
NCT was evaluated on five benchmarks, showing that it effectively concentrates minority subgroups in hard branches (e.g., 82% for landbird-on-water, 73% for blond-male). The model matched or approached state-of-the-art worst-group accuracy against eight baseline methods, demonstrating its robustness and interpretability.
Implications
The findings suggest that architectural approaches like NCT can significantly enhance the interpretability and robustness of machine learning models, particularly in applications where subgroup performance is critical. This could have implications for fields such as healthcare, finance, and any domain where fairness and transparency in AI are essential.
Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding
Theory
- Introduces Causal Gaussian Processes (CGP) for evaluating treatment effects with unobserved confounding.
- Develops a universal discretization method for approximating causal models in continuous domains.
- Demonstrates the effectiveness of CGP in mitigating confounding bias in observational data.
- Provides a framework that requires only basic temporal ordering between treatment and outcome.
Read more
Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding
Summary
This paper addresses the challenge of evaluating causal treatment effects in the presence of unobserved confounding bias, which complicates the identification of causal effects from observational data. Traditional methods often require detailed prior knowledge or are limited to discrete treatments and outcomes. The authors propose a novel approach using Causal Gaussian Processes (CGP) that allows for robust treatment effect evaluation in continuous domains. They introduce a universal discretization method that approximates the observational and interventional distributions of any causal model with arbitrary accuracy using a finite number of latent states. This approach leverages the flexibility of Gaussian processes to model complex relationships while mitigating the impact of confounding bias. The paper demonstrates that the CGP models can effectively capture the underlying causal relationships even when faced with biased observational data, thus providing a more reliable framework for causal inference in practical applications.
Methodology
The authors propose a universal discretization technique that approximates the causal model's observational and interventional distributions using a finite number of latent states. They then develop Causal Gaussian Process models that utilize this discretization to learn from confounded observations, allowing for robust causal effect estimation.
Results
The results indicate that the proposed CGP models can accurately approximate the causal effects even in the presence of unobserved confounding, outperforming traditional Gaussian process regression methods that amplify biases. The paper provides empirical evidence demonstrating the effectiveness of the CGP approach in various scenarios.
Implications
This work has significant implications for policy evaluation and causal inference in fields where unobserved confounding is prevalent. The CGP framework can be applied in economics, healthcare, and social sciences, enabling more accurate assessments of treatment effects and informing better decision-making.
A Gated Graph Neural Network Approach to Fast-Convergent Dynamic Average Estimation
Graph Learning
Robotics
Optimization
- Introduction of a GGNN-based learning model for dynamic average estimation.
- Formal analysis of stability properties and incorporation of a regularization term.
- Development of an encoding-decoding mechanism to minimize communication overhead.
- Demonstration of superior performance compared to traditional model-based estimators.
Read more
A Gated Graph Neural Network Approach to Fast-Convergent Dynamic Average Estimation
Summary
This paper addresses the challenge of dynamic average estimation in multi-agent systems, where agents collaboratively estimate time-varying signals through local information exchange. Traditional model-based approaches often struggle with convergence speed and sensitivity to network topology changes. The authors propose a novel solution using Gated Graph Neural Networks (GGNNs) to achieve fast convergence in a fully distributed manner. The method models the estimation process as a distributed autoregressor, ensuring stability and rapid convergence. A regularization term is introduced during training to enforce convergence guarantees, and an encoding-decoding mechanism is implemented to reduce communication overhead while maintaining accuracy. Extensive numerical experiments demonstrate that the proposed GGNN-based approach significantly outperforms conventional model-based estimators in terms of both convergence speed and precision, making it a promising alternative for applications requiring dynamic average estimation.
Methodology
The authors utilize Gated Graph Neural Networks (GGNNs) to model the dynamic average estimation process as a distributed autoregressor. They introduce a regularization term during training to ensure convergence and stability, and an encoding-decoding mechanism to reduce communication overhead.
Results
The proposed GGNN approach shows significant improvements in convergence speed and estimation precision compared to traditional model-based estimators. Numerical experiments validate the effectiveness of the method in various scenarios, demonstrating its robustness to network topology changes.
Implications
This research has potential applications in multi-agent systems such as autonomous drones, smart grids, and other distributed environments where real-time dynamic average estimation is critical for optimizing performance and efficiency.
Unsupervised Disentanglement Without Compromises : How Functional Orthogonality Enforces Identifiability
Theory
Generative Models
- Introduces functional orthogonality as a key property for unsupervised disentanglement.
- Proves that orthogonality leads to identifiability in nonlinear generative models without statistical independence.
- Empirical results confirm the effectiveness of orthogonality-regularized normalizing flows in recovering latent factors.
- Challenges existing impossibility claims regarding unsupervised disentanglement.
Read more
Unsupervised Disentanglement Without Compromises : How Functional Orthogonality Enforces Identifiability
Summary
This paper investigates unsupervised disentangled representation learning by introducing a functional perspective on latent concepts. The authors define latent concepts as factors influencing observations through locally orthogonal directions, formalized as an orthogonality constraint on the Jacobian of the generative mapping. They demonstrate that this orthogonality condition ensures identifiability of nonlinear generative models without requiring statistical independence or causal assumptions, provided the latent domain encompasses all combinations of factor values. The authors conduct experiments using orthogonality-regularized normalizing flows, which empirically validate their theoretical claims by reliably recovering ground-truth factors. This work challenges the prevailing belief that unsupervised disentanglement is fundamentally impossible, offering a principled alternative foundation that emphasizes functional orthogonality over statistical independence. The findings also provide insights into the success of Variational Autoencoders (VAEs) in promoting disentangled representations, suggesting a unified framework for understanding disentanglement and causality.
Methodology
The authors formalize the concept of functional orthogonality by imposing an orthogonality constraint on the Jacobian of the generative mapping. They utilize orthogonality-regularized normalizing flows to empirically validate their theoretical framework, demonstrating the recovery of ground-truth factors in a fully unsupervised setting.
Results
The experiments show that the proposed orthogonality constraint allows for reliable recovery of latent factors, supporting the theoretical claims that meaningful disentanglement can occur without the need for statistical independence or supervision. The findings also elucidate the mechanisms behind the effectiveness of VAEs in generating disentangled representations.
Implications
This work has significant implications for the field of representation learning, suggesting that models can achieve meaningful disentanglement without relying on strict independence assumptions. It opens avenues for further research into functional approaches to disentanglement and may influence the design of future generative models.
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
Time Series
- Introduces PaAno+, a lightweight model for time series anomaly detection.
- Utilizes multiscale feature extraction and cross-variable attention to improve anomaly detection accuracy.
- Implements a novel self-supervised learning task for better feature representation.
- Demonstrates state-of-the-art performance on the TSB-AD benchmark.
Read more
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
Summary
The paper presents PaAno+, a novel lightweight model designed for time series anomaly detection, addressing the limitations of existing methods that either incur high computational costs or fail to adequately capture multivariate dependencies. The model employs a patch-oriented representation learning approach, incorporating a multiscale feature extraction backbone that utilizes convolutional kernels with varying receptive fields to effectively capture hierarchical temporal characteristics. Additionally, it integrates cross-scale adaptive attention aggregation and a cross-variable fusion attention module to enhance the model's ability to identify anomalies in complex operational conditions. A unique self-supervised learning task based on temporal patch-window sorting is introduced to reveal the intrinsic structural properties of time series data, while triplet loss is used to optimize the patch embedding space for improved feature discrimination. Experimental results on the TSB-AD benchmark demonstrate that PaAno+ achieves state-of-the-art detection accuracy for both univariate and multivariate tasks, significantly outperforming previous models while maintaining computational efficiency suitable for real-time applications.
Methodology
The methodology involves a patch-oriented representation learning framework with a multiscale feature extraction backbone using convolutional kernels of varying sizes. It incorporates cross-scale adaptive attention aggregation and a cross-variable fusion attention module to capture inter-variable correlations. A self-supervised learning task based on temporal patch-window sorting is employed, along with triplet loss for optimizing the feature embedding space.
Results
PaAno+ achieved state-of-the-art detection accuracy on the TSB-AD benchmark for both univariate and multivariate anomaly detection tasks, showing significant performance improvements across various evaluation metrics compared to previous models.
Implications
The proposed model's efficiency and accuracy make it suitable for real-time anomaly detection in critical applications such as industrial monitoring and medical diagnostics, particularly in environments with limited computational resources.
Learning by Shifting: Temporal View Construction for Time Series Contrastive Learning
Time Series
- ShiFT introduces a deterministic view construction method that encodes temporal shift invariance.
- The approach outperforms complex augmentation-based methods while reducing training time.
- Empirical analysis reveals the impact of batch size and negative samples on representation quality.
- ShiFT achieves state-of-the-art results on multiple large-scale time series datasets.
Read more
Learning by Shifting: Temporal View Construction for Time Series Contrastive Learning
Summary
This paper addresses the challenges of supervised learning in time series analysis, which often requires extensive labeled data that is costly and time-consuming to obtain. The authors propose a self-supervised learning approach called Shift Invariant Feature Training (ShiFT), which focuses on contrastive learning for time series data. Unlike existing methods that rely on complex augmentations and domain-specific heuristics, ShiFT utilizes a simple, deterministic view construction method that encodes temporal shift invariance. By generating positive pairs through temporally shifted windows of time series data, ShiFT effectively captures the underlying phenomena without introducing spurious correlations. The authors validate their approach on six large-scale time series datasets, achieving state-of-the-art performance while also reducing training time. Additionally, they provide insights into the dynamics of contrastive learning in time series, exploring the effects of batch size and the number of negative samples on performance. The findings suggest that a straightforward view construction can yield high-quality representations, making ShiFT a promising framework for time series representation learning.
Methodology
The authors developed ShiFT by generating positive pairs from time series data through a deterministic process of splitting sequences into temporally shifted windows. This method preserves the structural properties of the signals and encourages the model to learn representations that are invariant to temporal translations. The authors conducted experiments on various datasets to evaluate the performance of ShiFT compared to existing methods.
Results
ShiFT demonstrated state-of-the-art performance across six large-scale time series benchmark datasets and the UCR and UEA archives. It achieved the fastest training times while maintaining the best average ranks on downstream tasks, indicating that the simplicity of the view construction does not compromise the quality of the learned representations.
Implications
The findings from this research suggest that self-supervised learning can be effectively applied to time series data without the need for complex augmentations. This has implications for various domains where time series data is prevalent, such as healthcare, robotics, and industrial monitoring, enabling more efficient and scalable representation learning.
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Time Series
- A systematic evaluation of forecasting models for influenza using ILI and hospitalization data.
- Mixture-of-experts models outperform other architectures, indicating the benefit of diverse pretrained representations.
- Numerical transformer-based models are reliable, especially with appropriate pretraining.
- LLM-based forecasting methods are less effective compared to traditional numerical approaches.
Read more
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Summary
This paper addresses the critical need for accurate short-term forecasting of seasonal influenza, which affects millions and poses significant public health challenges in the U.S. The authors conduct a systematic evaluation of various forecasting models using influenza-like illness (ILI) and hospitalization time series data. They compare classical neural networks, transformer-based models, pretrained time series foundation models, and large language model (LLM)-based approaches under both temporal and spatial generalization settings for 1-4 week ahead predictions. The study finds that a mixture-of-experts model, which integrates multiple pretrained forecasters, yields the best performance, highlighting the value of heterogeneous pretrained representations. Additionally, numerical transformer-based models demonstrate reliability, particularly when pretrained on data aligned with influenza dynamics. The study also reveals that LLM-based methods underperform compared to numerical forecasters. The incorporation of hospitalization data as an auxiliary covariate enhances forecasting robustness in certain scenarios. Overall, the findings provide actionable insights for model selection, pretraining strategies, and the use of auxiliary signals in influenza surveillance and preparedness.
Methodology
The authors compiled and standardized weekly ILI and hospitalization time series data at the U.S. HHS-region level. They evaluated 17 deep forecasting models under two generalization regimes: temporal (within-region) and spatial (across-region), focusing on 1-4 week ahead predictions. The evaluation employed consistent preprocessing and training pipelines, reporting metrics such as Mean Squared Error (MSE) and Normalized Root Mean Squared Error (NNSE).
Results
The study demonstrated that the mixture-of-experts model achieved the highest performance across forecasting tasks. Numerical transformer-based models provided reliable forecasts, particularly benefiting from pretraining aligned with influenza dynamics. In contrast, LLM-based methods showed inferior performance. The use of hospitalization data as an auxiliary covariate led to improvements in specific forecasting scenarios.
Implications
The findings suggest that public health agencies can enhance their influenza forecasting efforts by selecting appropriate models and leveraging auxiliary data. The insights gained from this study can inform vaccination strategies, hospital resource allocation, and overall epidemic preparedness.
MedTS-TTT: Test-Time Training for Medical Time Series Classification
Time Series
- MedTS-TTT enables online adaptation from unlabeled test samples, making it suitable for real-world clinical applications.
- The framework utilizes CLSA-TTT for efficient single-step fast-weight updates, avoiding the computational overhead of iterative optimization.
- MedTS-TTT achieved 11 top-1 rankings out of 12 evaluations across multiple metrics, showcasing its effectiveness.
- The Gated Convolutional Backbone enhances the model's ability to manage local dynamics and information flow in medical time series data.
Read more
MedTS-TTT: Test-Time Training for Medical Time Series Classification
Summary
The paper introduces MedTS-TTT, a novel test-time training framework specifically designed for medical time series classification, addressing the challenges posed by subject-level heterogeneity and distribution shifts in clinical data. Traditional methods often struggle with generalization to unseen individuals due to fixed parameter sets, while domain adaptation techniques can be cumbersome and require additional components. MedTS-TTT leverages Closed-Loop Self-Alignment Test-Time Training (CLSA-TTT) to facilitate rapid sample-wise adaptation without the need for iterative inner-loop optimization. The framework employs a Gated Convolutional Backbone (GCB) that integrates fast adaptation and token-level fusion, effectively balancing local dynamic modeling and information flow control. The proposed method was evaluated on four public datasets (two EEG and two ECG) with subject-independent splits, demonstrating superior performance and robustness against subject-level distribution shifts compared to nine baseline methods.
Methodology
MedTS-TTT is built on CLSA-TTT, which constructs a token-level self-supervised target and performs a single-step fast-weight update for intra-layer closed-loop alignment. A spatiotemporal tokenizer converts multi-channel medical signals into a token sequence, which is processed by the GCB that combines fast adaptation and token-level fusion.
Results
MedTS-TTT achieved 11 top-1 rankings out of 12 evaluations across 9 baselines and 3 metrics on four public datasets (2 EEG and 2 ECG), indicating significant improvements in robustness and practicality under subject-level distribution shifts.
Implications
The proposed framework has the potential to enhance the accuracy and reliability of medical time series classification in clinical settings, facilitating better decision-making and patient monitoring without the need for extensive labeled data.
DevoTG: Temporal Graph Neural Networks for Modeling C. elegans Developmental Connectomics
Graph Learning
Time Series
- Introduction of DevoTG, a framework for analyzing C. elegans neural development using temporal graph methods.
- Significant improvement in lineage prediction accuracy using TGNs compared to static GNNs.
- Identification of three classes of synaptic connection stability, enhancing understanding of neural connectivity dynamics.
- Provision of interactive visualizations to aid in biological hypothesis generation.
Read more
DevoTG: Temporal Graph Neural Networks for Modeling C. elegans Developmental Connectomics
Summary
The paper presents DevoTG, a novel framework utilizing Temporal Graph Neural Networks (TGNs) to model the developmental connectomics of the nematode C. elegans. This framework integrates two representations of neural development: a Continuous-Time Dynamic Graph (CTDG) capturing cell division events and a Discrete-Time Dynamic Graph (DTDG) representing the evolving synaptic connectome across developmental stages. The authors demonstrate that their TGN outperforms a static Graph Neural Network (GNN) in lineage prediction tasks, achieving a mean test AUC of 0.839, significantly higher than the static GNN's 0.577. Additionally, DevoTG identifies three classes of synaptic connection stability (stable, developmental, and variable) across 225 neurons, providing insights into the dynamic nature of neural connectivity. The framework also includes interactive visualizations to facilitate biological hypothesis generation, making it a valuable tool for studying developmental neuroscience. The open-source nature of DevoTG allows for its extension to other developing nervous systems.
Methodology
The methodology involves the application of Temporal Graph Neural Networks to two dynamic graph representations: a Continuous-Time Dynamic Graph (CTDG) for modeling cell lineage and a Discrete-Time Dynamic Graph (DTDG) for analyzing synaptic connectivity. The TGN learns from timestamped events to predict lineage outcomes and analyze connection stability across developmental stages.
Results
The TGN achieved a mean test AUC of 0.839 ยฑ 0.007 for lineage prediction, outperforming a static GNN by 26 AUC points. The analysis of the DTDG revealed three classes of synaptic connections, providing a temporal perspective on neural development. Interactive visualizations were created to illustrate the developmental dynamics.
Implications
DevoTG has significant implications for developmental neuroscience, offering a robust framework for understanding neural circuit formation and connectivity dynamics. Its open-source nature allows for broader applications in studying other developing nervous systems, potentially leading to new insights in neurodevelopmental research.
Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather, Calendar, and COVID-19 Indicators
Time Series
- The hybrid Transformer-XGBoost framework significantly outperforms a tabular-only XGBoost model in short-term electricity demand forecasting.
- COVID-19 indicators initially improved model accuracy but became less relevant as behavioral adaptations occurred post-pandemic.
- Hyperparameter optimization using Optuna enhanced the model's performance through efficient search strategies.
- The study emphasizes the importance of considering temporal validity decay in forecasting models affected by structural changes in demand patterns.
Read more
Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather, Calendar, and COVID-19 Indicators
Summary
This paper addresses the challenge of accurate short-term electricity demand forecasting in New England, which is crucial for effective power system operation and market planning. The authors propose a hybrid framework that combines a Transformer encoder for temporal feature extraction with XGBoost, a gradient-boosted decision tree model, to forecast daily electricity demand. The model incorporates a variety of features, including meteorological data from six cities across New England, calendar effects, autoregressive demand lags, and COVID-19 indicators. Hyperparameter optimization is performed using Optuna, resulting in a robust model that achieves a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906. The study also conducts an ablation analysis to assess the impact of COVID-19 features, revealing that while these features improved training accuracy, their predictive value diminished over time as behavioral adaptations occurred post-pandemic. The findings highlight the importance of temporal validity decay in forecasting models, suggesting that reliance on outdated pandemic patterns can lead to overfitting and reduced accuracy in predictions.
Methodology
The authors developed a hybrid forecasting model that utilizes a Transformer encoder for extracting temporal features from historical demand data, which is then combined with a rich set of engineered features (including weather, calendar, and COVID-19 indicators) and fed into an XGBoost model. Hyperparameter optimization was conducted using Optuna, employing a Bayesian optimization approach to fine-tune model parameters across 500 trials.
Results
The hybrid model achieved a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906, outperforming a baseline XGBoost model with RMSE of 9,304 MWh and MAPE of 2.21%. An ablation study indicated that removing COVID-19 features decreased the hybrid model's RMSE by 3.2%, while marginally improving the XGBoost-only model by 1.2%. The Diebold-Mariano test confirmed that the performance difference was statistically insignificant.
Implications
The findings suggest that while hybrid models can enhance forecasting accuracy, it is crucial to continuously evaluate the relevance of features, especially those influenced by external events like the COVID-19 pandemic. This research can inform energy market operators and policymakers about the dynamics of electricity demand and the importance of adapting forecasting models to changing behavioral patterns.
Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting
Time Series
Large Language Models
Generative Models
- Introduction of Diffusion-LLM framework that combines LLMs with conditional diffusion models for time series forecasting.
- Improvement in multimodal alignment and probabilistic modeling through a shared latent space.
- Significant performance gains in ultra-long-term and few-shot forecasting across multiple benchmarks.
- Demonstration of DDPMs as effective regularizers for enhancing LLM robustness.
Read more
Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting
Summary
This paper presents a novel framework called Diffusion-LLM, which integrates a conditional diffusion model into a Large Language Model (LLM) pipeline for time series forecasting. The authors address the limitations of existing LLMs in handling multimodal data and probabilistic modeling, particularly for ultra-long-term forecasting tasks. By embedding both inputs and targets into a shared latent space, the Diffusion-LLM framework enhances the model's ability to learn the conditional distribution of future data while improving semantic alignment. The methodology involves a Denoising Diffusion Probabilistic Model (DDPM) that acts as an implicit regularizer, allowing for better multimodal alignment and robust forecasting. The authors evaluate their approach on six long-term forecasting benchmarks, demonstrating significant improvements over existing LLM-based methods, particularly in ultra-long-term and few-shot forecasting scenarios. The results indicate that the integration of distribution-aware regularization enhances the robustness and generalization capabilities of LLMs in time series forecasting.
Methodology
The Diffusion-LLM framework employs a Denoising Diffusion Probabilistic Model (DDPM) integrated with a Large Language Model (LLM). It utilizes a reprogramming strategy to embed time series data into a shared token space, allowing for joint training of the LLM and DDPM. The DDPM estimates the conditional distribution of forecast embeddings based on the input lookback window, providing a distribution-aware signal that regularizes the LLM.
Results
The proposed Diffusion-LLM framework outperformed existing LLM-based baselines across six long-term forecasting benchmarks, including ETT, Weather, and ECL. The results showed notable improvements in ultra-long-term forecasting capabilities and enhanced performance in few-shot learning scenarios, demonstrating the effectiveness of distribution-aware regularization.
Implications
The findings suggest that integrating diffusion models with LLMs can significantly enhance the robustness and generalization of time series forecasting models. This approach could be applied in various domains requiring long-term predictions, such as energy systems, healthcare, and climate science, where accurate forecasting from limited historical data is critical.
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Large Language Models
Efficient ML
NLP
- UltraQuant improves 4-bit KV caching for context-heavy agents, addressing memory pressure and GPU utilization.
- The method incorporates practical design choices, including asymmetric key/value treatment and optimized decode-attention kernels.
- UltraQuant achieves a 3.47ร reduction in time-to-first-token during late rounds and a 1.63ร increase in output throughput over FP8 KV caching.
- The approach emphasizes the importance of serving efficiency metrics in evaluating KV-cache performance.
Read more
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Summary
The paper presents UltraQuant, a novel approach to 4-bit key-value (KV) caching designed for context-heavy agents that require efficient memory management during multi-turn interactions. The authors identify the challenges posed by long context lengths and high concurrency in serving systems, which can lead to inefficient GPU utilization. They propose a framework for 4-bit KV caching that balances task quality, cache residency, and serving throughput. Key contributions include the introduction of practical design choices for robust 4-bit caching, optimizations for AMD GPUs, and a focus on serving efficiency metrics. The UltraQuant method employs TurboQuant-style rotation and codebook quantization while introducing FP4 approximations to enhance performance. Experimental results demonstrate that UltraQuant significantly reduces time-to-first-token (TTFT) in cache-pressured scenarios and improves output throughput compared to the FP8 KV baseline, showcasing its effectiveness in managing long-context workloads.
Methodology
The authors utilize TurboQuant-style rotation and codebook quantization as a foundation for their 4-bit KV caching approach. They implement practical design choices such as asymmetric treatment of keys and values, Walsh-Hadamard rotation, and block-scale variants. The UltraQuant method leverages optimized decode-attention kernels and FP4 approximations on AMD GPUs to enhance performance and reduce latency.
Results
UltraQuant demonstrates a 3.47ร improvement in time-to-first-token (TTFT) during late rounds of multi-turn interactions and a 2.3ร improvement across all rounds compared to the FP8 KV baseline. Additionally, it increases output throughput by 1.63ร, particularly benefiting from better cache residency under high concurrency.
Implications
The findings suggest that UltraQuant can significantly enhance the performance of context-heavy agents in applications requiring long-running memory and high concurrency, such as conversational AI and complex task execution. This approach could lead to more efficient use of GPU resources and improved user experiences in interactive systems.
Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
Theory
Efficient ML
- Introduction of QCPIKAN, the first quantum-classical physics-informed Kolmogorov-Arnold network for PDEs.
- Theoretical proof of accelerated convergence rates and reduced numerical dispersion.
- Validation across three seepage scenarios in porous media demonstrates superior performance.
- Outperforms existing models in accuracy, error control, and dynamic tracking.
Read more
Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
Summary
This paper introduces QCPIKAN, a novel quantum-classical physics-informed Kolmogorov-Arnold network designed to effectively solve partial differential equations (PDEs). The framework integrates Chebyshev-polynomial KAN layers with parameterized quantum circuits, embedding physical constraints directly into the training loss to ensure physical consistency. The authors provide theoretical foundations based on approximation theory, demonstrating that this architecture significantly accelerates the convergence of high-frequency errors to an exponential rate while reducing numerical dispersion. The performance of QCPIKAN is validated through three typical seepage scenarios in porous media: single-phase flow, component transport, and two-phase flow. Results indicate that QCPIKAN outperforms existing quantum-classical physics-informed neural networks in terms of global prediction accuracy, local error control, dynamic evolution tracking, and displacement front localization. This work presents a robust and efficient alternative for addressing complex PDEs, showcasing the potential of combining quantum computing principles with classical neural network architectures.
Methodology
The QCPIKAN framework utilizes Chebyshev-polynomial KAN layers combined with parameterized quantum circuits. Physical constraints are embedded in the training loss function, allowing the network to adhere to physical laws while learning from data. The theoretical analysis is grounded in approximation theory to establish convergence properties.
Results
QCPIKAN demonstrated improved global prediction accuracy and local error control across various seepage scenarios compared to existing models. The framework effectively tracked dynamic evolutions and localized displacement fronts, showcasing its robustness and efficiency in solving complex PDEs.
Implications
The development of QCPIKAN has significant implications for scientific computing, particularly in fields requiring the solution of complex PDEs, such as fluid mechanics, bioengineering, and subsurface flow simulations. It opens avenues for further research into hybrid quantum-classical approaches in machine learning.
Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization
Reinforcement Learning
- ROVER maximizes state-space coverage for effective exploration in sparse-reward environments.
- The method employs a learned resolvent world model to estimate occupancy, addressing common estimation challenges.
- Introduction of a virtual 'sink' state stabilizes learning by managing unsupported state-action regions.
- Empirical results show ROVER achieves superior coverage and initialization compared to traditional reward-free methods.
Read more
Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization
Summary
This paper addresses the challenge of sparse rewards in reinforcement learning (RL) by proposing a novel method for pretraining exploration policies without the need for reward signals. The authors introduce ROVER (Reward-free pretraining via Occupancy coVERage maximization), which aims to maximize state-space coverage through an occupancy measure. This method is framed as an entropy maximization problem and utilizes a learned resolvent world model to estimate occupancy, thus avoiding common issues in density and entropy estimation. A key innovation of ROVER is the introduction of a virtual 'sink' state that helps balance the exploration of known states with the expansion into unexplored regions, preventing cyclic behaviors during learning. The authors demonstrate that ROVER outperforms standard reward-free baselines in both tabular and pixel-based sparse navigation tasks, leading to more uniform coverage and better initializations for downstream tasks. Overall, this work provides a robust framework for reward-free policy pretraining that is particularly beneficial in multi-task, meta-, and continual learning scenarios where rewards are sparse or unavailable.
Methodology
The authors formalize occupancy coverage as a target-free objective in Reproducing Kernel Hilbert Space (RKHS) and implement it in ROVER. The algorithm learns a representation, estimates the kernel mean embedding of the occupancy measure, and computes policy gradients for its squared norm. The policy improvement is achieved through policy mirror descent, with the addition of a virtual sink state to manage unsupported regions.
Results
Experiments conducted in both tabular and pixel-based sparse navigation tasks demonstrate that ROVER produces more uniform aggregate coverage and provides stronger initializations for downstream tasks compared to standard reward-free baselines.
Implications
The findings suggest that ROVER could significantly enhance the training of reinforcement learning agents in environments with sparse rewards, making it particularly useful for applications in multi-task and continual learning scenarios. This method could lead to more efficient exploration strategies and improved performance in complex RL tasks.
A Reward-Petri-Net Interpretation of Temporal Behavior Trees
Reinforcement Learning
Robotics
Theory
- Introduces a method to interpret Temporal Behavior Trees as Reward-Petri-Nets for reinforcement learning.
- Demonstrates how TBTs can improve reward function design for complex robotic tasks with temporal constraints.
- Shows that TBT-based rewards enhance sample efficiency and learning in challenging environments.
- Provides a systematic way to assign rewards based on user-defined task importance and structure.
Read more
A Reward-Petri-Net Interpretation of Temporal Behavior Trees
Summary
This paper presents a novel approach to reinforcement learning (RL) by interpreting Temporal Behavior Trees (TBTs) as Reward-Petri-Nets (RPNs). TBTs enhance conventional behavior trees by integrating temporal properties into their structure, allowing for the representation of complex robotic tasks with hierarchical and temporal constraints. The authors propose a method to automatically derive reward functions from TBT specifications, which are expressed using Linear Temporal Logic. By translating TBTs into RPNs, the paper demonstrates how rewards can be systematically assigned based on the TBT's structure, thereby improving the learning process in RL. The experimental results show that TBT-based rewards significantly enhance learning efficiency in challenging environments where traditional RL methods struggle, providing intuitive control over the learning process and enabling better exploration of state spaces. The findings suggest that this approach can effectively address the challenges of long-horizon sparse reward environments in RL.
Methodology
The authors translate Temporal Behavior Trees into Reward-Petri-Nets, allowing for the automatic assignment of rewards based on the structure of the TBTs. They utilize Linear Temporal Logic to specify temporal constraints in the leaf nodes of TBTs, which are then encoded in the RPNs. The methodology includes a series of experiments in various environments to evaluate the effectiveness of TBT-based rewards compared to traditional RL approaches.
Results
The experimental results indicate that the integration of TBTs into the RL framework leads to improved learning outcomes, particularly in environments characterized by sparse rewards. The TBT-based rewards facilitate better exploration and learning efficiency, enabling agents to successfully complete tasks that conventional RL methods fail to learn. The study also demonstrates that different reward distribution schemes and TBT structures can significantly impact the learning process.
Implications
This research has significant implications for the design of reward functions in reinforcement learning, particularly in robotics and other applications requiring complex task execution with temporal constraints. The ability to derive rewards from formal specifications like TBTs can enhance the performance of RL agents in real-world scenarios, making it easier to implement safe and effective learning strategies.
Parameterized Representations via Implicit Stochastic Modulation for High-Dimensional and High-Order Neural PDE Solvers
Theory
Optimization
Efficient ML
- PRISM decouples parameter encoding from the spatial AD graph, addressing memory growth issues.
- The architecture enables zero-shot extrapolation for parameterized PDEs without retraining.
- PRISM supports efficient scaling of high-dimensional PDEs, achieving up to 100,000 dimensions on a single GPU.
- Variance-aware Lipschitz damping is incorporated to enhance optimization stability.
Read more
Parameterized Representations via Implicit Stochastic Modulation for High-Dimensional and High-Order Neural PDE Solvers
Summary
This paper addresses the challenges of solving high-dimensional and high-order partial differential equations (PDEs) using neural networks. Traditional methods struggle with the curse of dimensionality, leading to excessive memory and computational costs. Recent advancements in stochastic derivative estimators have improved scalability but are limited to fixed parameter environments, necessitating retraining for each new parameter configuration. The authors introduce Parameterized Representations via Implicit Stochastic Modulation (PRISM), a novel architecture that decouples parameter encoding from the spatial automatic differentiation (AD) graph. This decoupling mitigates memory growth and variance issues associated with high-order stochastic solvers. PRISM employs a hyper-generator to process physical parameters, producing modulators that scale and shift a spatial latent manifold. The architecture achieves zero-overhead AD decoupling and provides variance-aware Lipschitz damping, enabling efficient training of parameterized PDEs up to 100,000 dimensions on a single GPU. The proposed method supports zero-shot extrapolation and physical inversion while avoiding the instabilities of conventional architectures.
Methodology
The authors propose the PRISM architecture, which utilizes a hyper-generator to create affine modulators that adjust a continuous latent manifold. This design allows for the separation of physical parameters from the spatial computation graph, thus preventing the entanglement that leads to increased memory usage and optimization instability. The architecture leverages stochastic derivative estimators to efficiently compute gradients without the computational overhead of traditional methods.
Results
Extensive experiments demonstrate that PRISM can effectively scale stochastic solvers for highly non-linear parameterized PDEs, achieving performance on par with or exceeding existing methods while maintaining memory efficiency. The architecture successfully enables zero-shot extrapolation and physical inversion, showcasing its versatility and robustness in handling extreme physical conditions.
Implications
The PRISM framework has significant implications for various scientific fields that rely on solving high-dimensional PDEs, such as finance, control theory, and physics. Its ability to handle parameterized PDEs without retraining opens new avenues for real-time simulations and digital twin applications, enhancing the efficiency of modeling complex systems.
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Efficient ML
- Investigates efficient network inference for GNSS interference characterization under strict resource constraints.
- Utilizes a deployment-oriented compression pipeline combining pruning and quantization with MCUNet as a baseline.
- Applies hardware-aware zero-shot NAS to optimize network architecture and pruning configurations.
- Demonstrates trade-offs between predictive performance and deployment efficiency through experimental evaluations.
Read more
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Summary
This paper addresses the challenge of efficient network inference for embedded global navigation satellite system (GNSS) interference monitoring, which requires rapid and memory-efficient processing of large volumes of in-phase and quadrature (IQ) samples. The authors propose a framework that combines iterative structured pruning, post-training static quantization, and hardware-aware zero-shot neural architecture search (NAS) to optimize deep neural networks (DNNs) for resource-constrained environments. Starting from the MCUNet architecture, the study evaluates how model compression and architecture optimization impact model size, computational complexity, and memory usage while preserving performance. Experiments conducted on a GNSS interference dataset demonstrate the effectiveness of the proposed methods, revealing that the combination of compression techniques and hardware-aware design significantly enhances the deployability of ML models on embedded platforms such as the iMXRT1062 MCU and Raspberry Pi devices. The findings offer practical insights for developing compact ML models suitable for real-time GNSS interference monitoring.
Methodology
The methodology involves a combination of iterative structured pruning to reduce model complexity, post-training static quantization to minimize memory usage, and hardware-aware zero-shot neural architecture search (NAS) to optimize network design without full training. The approach is evaluated using a GNSS interference dataset, focusing on both classification and characterization tasks.
Results
The results indicate that the proposed framework effectively reduces model size and computational requirements while maintaining high predictive performance. The experiments reveal that the optimized models can operate efficiently on embedded platforms, providing practical configurations that are competitive with uncompressed baselines.
Implications
The findings have significant implications for the deployment of machine learning models in resource-constrained environments, particularly for real-time applications in GNSS interference monitoring. The methodologies developed can be applied to other domains requiring efficient model inference on embedded systems.
Computational Identifiability
Theory
- Introduction of computational identifiability as a practical, computation-bound notion of identifiability.
- Formalization of the relationship between causal effect estimation and meta-learning.
- Empirical demonstration of computational identifiability in complex scenarios.
- Provision of a framework for identifying causal effects with finite samples and error tolerances.
Read more
Computational Identifiability
Summary
This paper introduces the concept of 'computational identifiability,' which contrasts with traditional theoretical identifiability in causal inference. The authors argue that while theoretical identifiability relies on idealized conditions such as infinite data, computational identifiability focuses on the practical aspects of identifying causal effects through finite computational procedures. The framework defines successful identification as the existence of an estimator that meets specified error tolerances and confidence bounds, given a prior distribution over parameters. The paper empirically demonstrates this framework across various complex scenarios, including small sample sizes, ambiguous graphical criteria, and mixed observational-interventional data. By formalizing the connection between causal effect estimation and meta-learning, the authors provide a comprehensive approach to tackling identification questions that are often challenging under traditional methods.
Methodology
The authors propose a framework for computational identifiability that involves defining a meta-prior over parameters and a hypothesis space of estimators. They conduct empirical experiments to validate their framework across various scenarios, including small sample sizes and mixed data types.
Results
The experiments show that computational identifiability can effectively address identification questions in settings where traditional theoretical methods fall short. The framework successfully identifies causal effects even with limited data and ambiguous criteria, demonstrating its practical applicability.
Implications
The concept of computational identifiability has significant implications for causal inference in real-world applications, particularly in fields where data is limited or complex. It provides a new lens through which researchers can approach identification problems, potentially leading to more robust causal analyses.
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Efficient ML
Theory
Optimization
- PIBLS is the first application of Broad Learning System (BLS) to solving PDEs, offering a backpropagation-free computational framework.
- The framework reformulates PDE solving as a direct least-squares optimization, enhancing computational efficiency.
- Rigorous mathematical proof establishes PIBLS's universal approximation property for PDE solutions.
- Experimental results show PIBLS is significantly faster and more accurate than traditional PINNs.
Read more
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Summary
This paper introduces the Physics-Informed Broad Learning System (PIBLS), a novel framework designed to solve partial differential equations (PDEs) more efficiently than traditional numerical methods and existing Physics-Informed Neural Networks (PINNs). Traditional numerical solvers, while robust, are often limited by high computational costs due to mesh dependencies. In contrast, PINNs provide a mesh-free alternative but struggle with slow convergence and optimization instability. PIBLS addresses these issues by reformulating PDE solving as a direct least-squares optimization problem, allowing for faster and more stable solutions. The authors present a unique solver strategy that includes an analytical solution for linear PDEs and an enhanced nonlinear least-squares perturbation algorithm for nonlinear PDEs. They also provide a rigorous mathematical proof of PIBLS's universal approximation property, ensuring its capability to approximate solutions to PDEs. Experimental results demonstrate that PIBLS is one to three orders of magnitude faster than conventional PINNs while achieving significantly higher accuracy, establishing it as a promising alternative for real-time simulation and design optimization tasks in scientific machine learning.
Methodology
The PIBLS framework utilizes a Broad Learning System architecture, where input coordinates are projected into a system of randomly generated feature and enhancement nodes. The output is computed as a linear combination of these nodes, with the weights optimized through least-squares methods. The methodology includes deriving analytical derivatives for the network output and formulating the PDE solving task as a least-squares optimization problem, with specific strategies for both linear and nonlinear PDEs.
Results
The experiments conducted demonstrate that PIBLS outperforms conventional PINNs by being one to three orders of magnitude faster while achieving significantly higher solution accuracy across various linear and nonlinear PDEs.
Implications
The PIBLS framework has the potential to revolutionize the computational efficiency of scientific machine learning applications, particularly in real-time simulations and design optimization tasks across physical, biological, and engineering systems.
VIMPO: Value-Implicit Policy Optimization for LLMs
Reinforcement Learning
Large Language Models
Optimization
- VIMPO is a critic-free policy optimization method that improves reasoning in LLMs.
- It derives a policy-implied value function using KL-regularized reinforcement learning principles.
- The method allows for token-level credit assignment without the instability of a learned critic.
- Empirical results show VIMPO outperforms existing methods like GRPO, especially under noisy rewards.
Read more
VIMPO: Value-Implicit Policy Optimization for LLMs
Summary
The paper introduces VIMPO (Value-Implicit Policy Optimization), a novel method for reinforcement learning that enhances the reasoning capabilities of large language models (LLMs) without the need for a critic. Traditional reinforcement learning approaches face a trade-off between simplicity and effective credit assignment. Actor-critic methods provide dense learning signals but require a learned value function, which can lead to training instability. Conversely, group-relative methods like GRPO simplify training by using trajectory-level advantages but lack fine-grained credit assignment. VIMPO addresses these challenges by deriving a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. This allows for a critic-free value optimization objective that incorporates outcome-level verifiable rewards. The method also provides a closed-form one-step temporal-difference advantage, enabling token-level credit assignment without a learned critic. Experimental results demonstrate that VIMPO outperforms GRPO on various benchmarks, particularly in competition-style evaluations, and shows resilience against noisy rewards, suggesting it can provide finer credit assignment while maintaining practical simplicity.
Methodology
VIMPO models autoregressive generation as a deterministic-transition Markov Decision Process (MDP) and derives a closed-form representation of the optimal value function. It uses a terminal boundary condition to create a critic-free value optimization objective, which trains a policy-implied value function. The method integrates a closed-form one-step temporal-difference advantage into a PPO-style actor update, enabling effective token-level credit assignment.
Results
VIMPO demonstrated significant improvements over the GRPO baseline across various mathematical RLVR benchmarks, including MATH-500, AIME 2024, AIME 2025, and OlympiadBench. It achieved faster training and higher validation accuracy, particularly excelling in competition-style evaluations. Under conditions of noisy rewards, VIMPO maintained a consistent advantage, indicating its robustness and effectiveness in credit assignment.
Implications
The development of VIMPO has potential applications in enhancing the performance of LLMs in complex reasoning tasks, such as mathematical problem solving and code generation. Its critic-free approach may simplify the training process while improving the model's ability to assign credit accurately to individual tokens, which is crucial for tasks requiring multi-step reasoning.
When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting
Time Series
Efficient ML
Theory
- Introduction of Guard framework for dynamic multi-teacher knowledge distillation.
- Adaptive mechanisms for selecting teacher models based on input statistics and uncertainty.
- Significant RMSE reduction compared to traditional distillation methods.
- Demonstrated effectiveness in four scientific domains despite distributional misalignment.
Read more
When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting
Summary
This paper addresses the challenges of deploying Time-Series Foundation Models (TSFMs) in scientific domains, particularly due to distributional misalignment and high computational costs. The authors propose a novel framework called Gated Uncertainty-Aware Routing for Distillation (Guard), which aims to extract latent structural knowledge from multiple misaligned foundation models to train lightweight, specialized forecasters. Guard employs two key mechanisms: a Contextual Router that selects the most relevant teacher model based on local input statistics, and an Uncertainty-Gated Temperature mechanism that adjusts the distillation strength based on the confidence of the teacher models. The framework is evaluated across four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. The results demonstrate that Guard significantly reduces RMSE compared to a fixed-weight multi-teacher distillation baseline, effectively distilling knowledge from pretrained models even in cases of suboptimal zero-shot accuracy. The findings indicate that domain-misaligned teachers can still provide valuable corrections, outperforming globally superior models in challenging instances, thus enabling high-precision forecasting suitable for resource-constrained environments.
Methodology
The Guard framework utilizes a two-pronged approach: a Contextual Router for dynamic teacher selection based on local input characteristics, and an Uncertainty-Gated Temperature mechanism to modulate the strength of knowledge distillation according to the reliability of the teacher models. This allows for instance-wise decision-making during the training process, enhancing the model's ability to adapt to diverse temporal dynamics.
Results
The evaluation of Guard showed a significant reduction in RMSE across various forecasting tasks compared to a fixed-weight multi-teacher distillation baseline. The framework successfully distilled knowledge from pretrained foundation models, even when they exhibited poor zero-shot performance due to distribution shifts. Notably, Guard outperformed globally superior models on 28.5% of the most challenging instances.
Implications
The findings suggest that Guard can facilitate the deployment of robust time-series forecasting models in scientific applications, particularly in resource-constrained settings such as edge-computing environments. This could enhance the monitoring and prediction capabilities in critical areas like meteorology and ecosystem management.
Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks
Efficient ML
Optimization
Computer Vision
- Introduces the first configuration space tailored for Deep Shift Neural Networks (DSNNs).
- Combines multi-objective and multi-fidelity optimization techniques for efficient AutoML.
- Demonstrates significant improvements in accuracy and reductions in emissions for optimized DSNNs.
- Reveals model-specific trade-offs in quantization strategies that enhance energy efficiency.
Read more
Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks
Summary
This paper addresses the environmental and resource challenges posed by deep learning (DL) models, particularly in low-resource environments. The authors focus on Deep Shift Neural Networks (DSNNs), which utilize shift operations to reduce computational complexity during inference. By employing AutoML techniques, the study aims to optimize DSNN configurations for image classification tasks, balancing accuracy and energy consumption. The authors introduce a multi-objective hyperparameter optimization (HPO) approach that combines multi-fidelity and multi-objective optimization to explore trade-offs in DSNN design. The results demonstrate that optimized DSNN configurations can achieve a performance increase of approximately 20% while reducing emissions by over 60%. The findings reveal that quantizing smaller portions of the network can lead to significant energy savings without compromising performance, highlighting the importance of tailored quantization strategies. This research contributes to the field of Green AutoML by providing insights into efficient DSNN design and offering a repository for further exploration.
Methodology
The authors employed a multi-objective hyperparameter optimization approach that integrates multi-fidelity and multi-objective optimization techniques. They extended the SMAC3 framework to balance predictive accuracy and energy consumption, utilizing tools like CodeCarbon to measure energy usage and emissions during model training and evaluation.
Results
The optimized DSNN configurations achieved a performance increase of about 20% while reducing emissions by more than 60%. The study also found that quantizing smaller portions of the network with low precision could yield optimal energy consumption without sacrificing performance, with these findings corroborated across various backbone architectures.
Implications
The research has significant implications for developing energy-efficient deep learning applications, particularly in resource-constrained environments such as edge computing and IoT. It provides a framework for optimizing neural networks to minimize environmental impact while maintaining high performance, contributing to the broader goals of sustainable AI.
EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
NLP
- EvoRubrics enables real-time co-evolution of rubrics and policies, enhancing the effectiveness of reinforcement learning.
- The framework uses adversarial interactions to ensure that evaluation standards adapt to the evolving capabilities of the model.
- EvoRubrics consistently outperforms static and dynamic rubric baselines across multiple benchmarks.
- A self-supervised variant of EvoRubrics achieves meaningful performance gains, highlighting the potential for unsupervised learning.
Read more
EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning
Summary
The paper introduces EvoRubrics, a novel co-evolutionary reinforcement learning (RL) framework designed to enhance the training of large language models (LLMs) by dynamically adapting evaluation rubrics. Traditional rubric-based rewards often become static and lose effectiveness as the model improves, leading to reward saturation and suboptimal learning. EvoRubrics addresses this issue by allowing a Policy LLM and a Rubric Generator to evolve together through adversarial interactions at each training step. This dynamic adaptation ensures that the rubric remains discriminative and informative, providing real-time feedback that aligns with the model's capabilities. The framework not only improves the quality of the generated responses but also facilitates an automatic curriculum where the evaluation criteria become progressively more challenging. The authors demonstrate that EvoRubrics outperforms both static and existing dynamic rubric methods across various benchmarks, and even a self-supervised variant of the approach yields significant performance improvements. This indicates that the adversarial relationship between generation and evaluation can generate rich learning signals without the need for external supervision.
Methodology
EvoRubrics employs a co-evolutionary framework where a Policy LLM and a Rubric Generator are jointly optimized using dual LoRA adapters. The training process involves adversarial interactions that allow the Rubric Generator to adapt its evaluation criteria based on the performance of the Policy LLM, ensuring that the rubrics remain relevant and challenging throughout the training process.
Results
EvoRubrics demonstrated superior performance compared to static and dynamic rubric baselines across various benchmarks. The framework's ability to adaptively refine evaluation criteria led to improved learning outcomes for the Policy LLM. Additionally, the self-supervised variant of EvoRubrics achieved significant performance improvements, indicating the effectiveness of the co-evolutionary approach.
Implications
The findings suggest that dynamic rubric generation can significantly enhance the training of LLMs in open-ended tasks, where traditional evaluation methods may fall short. The ability to adaptively refine evaluation criteria could lead to more robust and capable language models, with applications in creative writing, open-ended question answering, and other domains requiring nuanced evaluation.
On the Position Bias of On-Policy Distillation
Reinforcement Learning
Optimization
Efficient ML
- Identifies the position bias phenomenon in OPD, where early tokens provide more valuable supervision than later ones.
- Proposes IW-OPD, which adjusts token weights based on the accumulated discrepancy between student and teacher distributions.
- Demonstrates that IW-OPD converges faster and achieves better performance than standard OPD.
- Shows that the advantages of IW-OPD increase with the mismatch between teacher and student models.
Read more
On the Position Bias of On-Policy Distillation
Summary
This paper addresses the inefficiencies in On-Policy Distillation (OPD) in reinforcement learning, particularly the position bias that arises from uniformly averaging token-level losses. The authors identify that as student rollouts extend, they diverge from the teacher's distribution, leading to diminished supervision quality for later tokens. They propose Importance-Weighted On-Policy Distillation (IW-OPD), which assigns weights to tokens based on the discrepancy between the student and teacher distributions, effectively upweighting earlier tokens and downweighting later ones. Through a constrained optimization perspective, the authors demonstrate that IW-OPD converges faster than standard OPD and achieves superior final performance, particularly in scenarios with significant teacher-student mismatches. Their experiments show that IW-OPD improves learning efficiency and final performance metrics, achieving a notable increase of 6.9 points on the AIME-2025 benchmark compared to standard OPD.
Methodology
The authors analyze the position bias in OPD through constrained optimization, leading to the development of IW-OPD. This method uses a closed-form optimal policy to assign weights to tokens based on the likelihood ratio of teacher-to-student distributions, allowing for more effective supervision by emphasizing earlier tokens in the rollouts.
Results
IW-OPD significantly outperforms standard OPD in terms of convergence speed and final performance, with improvements of up to 6.9 points on the AIME-2025 benchmark. The method shows enhanced sample efficiency, particularly for smaller student models, and demonstrates that the performance gains scale with the degree of mismatch between teacher and student models.
Implications
The findings suggest that optimizing token supervision in OPD can lead to more efficient learning in reinforcement learning applications. This has potential implications for training smaller models with stronger teachers, improving the overall efficiency of model distillation processes in various applications.
Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
Optimization
Theory
- Introduces a model for bandit optimization with C-approximately convex and ฮฒ-smooth function sequences.
- Establishes expected regret guarantees that account for adversarial perturbations under a global budget.
- Demonstrates that sublinear expected regret is achievable even with non-convex losses.
- Modifies existing bandit algorithms to accommodate the new perturbation model.
Read more
Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
Summary
This paper addresses the problem of adversarial bandit optimization where the loss functions are allowed to be non-convex and non-smooth. The authors propose a framework where the learner selects actions and incurs losses that consist of a convex, ฮฒ-smooth component and an adversarial perturbation, which is subject to a global budget constraint on its cumulative magnitude over time. This model extends previous work by allowing for general convex and ฮฒ-smooth losses instead of just linear losses. The authors establish expected regret guarantees that account for the perturbation budget, demonstrating that sublinear expected regret can still be achieved even when the observed losses deviate from convexity, provided the cumulative deviation is controlled. The analysis involves modifying a standard bandit optimization algorithm and separating the contributions of the convex components from the perturbations, leading to a clearer understanding of how perturbations affect regret. The results indicate that the proposed method can effectively handle adversarial perturbations while maintaining performance in bandit optimization settings.
Methodology
The authors modify a standard bandit optimization algorithm to accommodate a model where losses are composed of a convex component and an adversarial perturbation. They develop a regret analysis that disentangles the contributions of the convex components from the perturbations, allowing for a clearer understanding of the regret incurred due to the perturbations. The analysis employs a bandit smoothing argument to control the expected regret.
Results
The paper establishes that under the global perturbation budget assumption, the expected regret can be controlled and remains sublinear, even when the observed losses are not strictly convex. The results provide explicit regret bounds that depend on the perturbation budget, demonstrating the effectiveness of the proposed approach in handling adversarial perturbations.
Implications
The findings have significant implications for online decision-making scenarios where loss functions may be subject to adversarial influences, such as in online pricing, resource allocation, and other applications where querying the system is costly. The ability to maintain performance despite non-convex perturbations broadens the applicability of bandit optimization techniques.
An Empirical Study of OpenPangu Quantization on Ascend NPUs
NLP
Large Language Models
Efficient ML
- 8-bit weight-only quantization is effectively lossless for OpenPangu models.
- 4-bit quantization is practical for the 7B model but harmful for the 1B model.
- Ultra-low precision quantization (2-bit and binary) often results in poor performance.
- The study provides a comprehensive evaluation of various quantization methods on Ascend NPUs.
Read more
An Empirical Study of OpenPangu Quantization on Ascend NPUs
Summary
This paper investigates the robustness of OpenPangu models under aggressive post-training quantization (PTQ) on Huawei Ascend NPUs. The authors conduct a systematic empirical study of the OpenPangu 1B and 7B models, evaluating various quantization methods including RTN, GPTQ, AWQ, SmoothQuant, GPTAQ, BiLLM, and SliM-LLM across 18 evaluation tasks. The study reveals that 8-bit weight-only quantization is effectively lossless for both models, while 4-bit quantization is practical for the 7B model but detrimental for the 1B model, particularly in reasoning, math, and code tasks. The results highlight challenges in ultra-low precision quantization, with most 2-bit and binary settings leading to near-random behavior. This research provides an NPU-oriented accuracy map for selecting quantization settings for OpenPangu models and emphasizes the difficulties associated with extreme low-bit compression.
Methodology
The authors systematically evaluated OpenPangu 1B and 7B models using a range of post-training quantization methods. They maintained a unified calibration and evaluation protocol across various quantization settings, testing both weight-only and weight-activation methods. The evaluation included perplexity measurements on language modeling tasks and accuracy assessments on commonsense reasoning and knowledge benchmarks.
Results
The results indicate that while 8-bit quantization maintains model performance, 4-bit quantization is less effective for smaller models. The study found that ultra-low precision quantization methods (2-bit and binary) generally led to significant performance degradation, with some configurations resulting in non-finite perplexity values. The findings provide a detailed accuracy map for selecting quantization settings tailored for Ascend NPUs.
Implications
The findings have significant implications for deploying large language models in resource-constrained environments, particularly in private and domain-specific applications. The study aids in understanding the trade-offs involved in quantization, guiding practitioners in selecting appropriate methods and bit-widths for effective model deployment.
Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models
NLP
Large Language Models
- Formalizes the concept of Parametric Temporal Conflict (PTC) in language models.
- Introduces Temporal Attractor Steering (TAS) as a retrieval-free, inference-time intervention.
- Demonstrates that TAS can effectively recover newer facts while preserving accuracy on non-conflict queries.
- Evaluates TAS across multiple models and a comprehensive benchmark dataset.
Read more
Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models
Summary
This paper addresses the issue of Parametric Temporal Conflict (PTC) in large language models (LLMs), where models retain both outdated and updated facts, leading to incorrect responses during inference. The authors introduce a novel framework called Temporal Attractor Steering (TAS), which operates in three stages: detecting likely conflicts, localizing the conflict-critical layer, and steering the hidden states towards the newer-fact representations without the need for retraining or external retrieval. The study constructs a benchmark dataset with 8,746 records across five Wikidata relations and evaluates TAS on four open-weight LMs. The results indicate that TAS can recover 29-57% of PTC cases while maintaining high accuracy (85-99%) on non-conflict queries, outperforming a matched baseline on three out of four models. This work highlights a mechanism for selectively overriding outdated knowledge at inference time, providing insights into the management of temporal knowledge in LLMs.
Methodology
The methodology involves a three-stage process: (1) detecting PTC cases using a knowledge-recovery filter, (2) localizing conflict-critical layers through activation patching, and (3) steering the model's hidden states towards the updated fact representation. This is achieved without retraining the model or utilizing external retrieval mechanisms.
Results
TAS successfully recovers 29-57% of PTC cases while maintaining 85-99% accuracy on non-conflict queries. The activation patching technique achieves answer-flip rates between 0.72 and 0.85 across all evaluated models. Overall, the PTC rates increase from 0.041 to 0.103, indicating a measurable and resolvable nature of temporal conflicts in LLMs.
Implications
The findings suggest that LLMs can be made more reliable in providing up-to-date information by implementing inference-time interventions like TAS. This has potential applications in real-time information retrieval systems, knowledge management, and improving the accuracy of AI-driven responses in dynamic environments.
On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA
Theory
- Demonstrates a significant curse of dimensionality in DP covariance estimation and PCA.
- Establishes poly(k, log d) sample complexity for DP PCA under additional sparsity assumptions.
- Provides poly(d) lower bounds for both sparse covariance estimation and PCA under DP.
- First to show an exponential gap between private and non-private sample complexities in sparse estimation.
Read more
On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA
Summary
This paper investigates the challenges of high-dimensional differentially private (DP) covariance estimation and principal component analysis (PCA) under k-row-column sparsity (k-RCS) of the covariance matrix. It highlights a significant gap in sample complexity between private and non-private settings, where non-private methods require poly(k, log d) samples, while existing DP methods necessitate โฆ(d) samples. The authors demonstrate that under certain conditions, specifically when the leading eigenvector is sparse, it is possible to achieve poly(k, log d) sample complexity for DP PCA. They also establish lower bounds showing an exponential gap between private and non-private variants when k is polylogarithmic in d. This work is notable for being the first to demonstrate such a separation in the context of sparse estimation problems in private high-dimensional statistics, and it provides insights into the inherent challenges of achieving differential privacy in these settings.
Methodology
The authors develop new algorithms for differentially private sparse covariance estimation and PCA, leveraging structural assumptions about the covariance matrix and the leading eigenvector. They analyze the privacy and utility of these algorithms, providing both upper and lower bounds for sample complexity under different models of sparsity.
Results
The paper presents upper bounds showing that under certain sparsity conditions, poly(k, log d) samples are sufficient for DP PCA, while establishing lower bounds indicating that โฆ(d) samples are necessary for general DP covariance estimation and PCA. This reveals a stark contrast in sample requirements between private and non-private settings, particularly when k is polylogarithmic in d.
Implications
The findings suggest that achieving differential privacy in high-dimensional statistics, particularly for sparse covariance estimation and PCA, is fundamentally more challenging than in non-private settings. This has implications for the design of algorithms in sensitive data applications, where privacy guarantees are essential.
Geometric and Information Compression of Representations in Deep Learning
Theory
- Low mutual information (MI) does not reliably indicate geometric compression in latent representations.
- The relationship between MI and geometric compression is negative and nonlinear, influenced by training conditions.
- Generalization may confound the connection between MI and geometric compression.
- The study employs CEB networks and continuous dropout networks for robust MI estimation.
Read more
Geometric and Information Compression of Representations in Deep Learning
Summary
This paper investigates the relationship between geometric and information compression in the latent representations generated by deep neural networks (DNNs). The authors explore whether low mutual information (MI) between inputs and representations correlates with geometrically compressed latent spaces. They utilize class-wise clustering as a measure of geometric compression and employ conditional entropy bottleneck (CEB) networks and continuous dropout networks for MI estimation. Through controlled noise injection experiments, the study reveals that low MI does not consistently indicate geometric compression, suggesting a more complex relationship. The findings indicate a negative and nonlinear correlation between MI and geometric compression, which can change based on the training setup. The authors propose that generalization may act as a confounder in this relationship rather than a direct outcome. The paper contributes to the understanding of representation learning by providing empirical evidence and theoretical insights into the interplay between information-theoretic and geometric dimensions of latent representations.
Methodology
The authors conducted empirical evaluations using DNNs trained with conditional entropy bottleneck (CEB) and continuous dropout techniques. They estimated mutual information (MI) between inputs and latent representations and assessed geometric compression through class-wise clustering measures. The study involved large-scale analyses across multiple architectures and datasets, focusing on end-of-training representations to capture stable properties.
Results
The results indicate that low MI can arise from either strong noise or tightly clustered encodings, but geometric compression, as measured by neural collapse, only occurs in the latter case. The experiments demonstrated that the connection between MI and geometric compression is not linear and can vary significantly based on the training setup, challenging previous assumptions in the literature.
Implications
These findings have implications for the design and training of deep learning models, particularly in understanding how to achieve effective representation learning that balances information retention and geometric properties. The insights could inform future research on improving generalization in DNNs and optimizing their latent spaces for various applications.
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Theory
Efficient ML
- Introduces a lightweight defense mechanism against FDIA in DNNs used in CPS.
- Utilizes pseudo-feature padding to increase input dimensionality and complexity.
- Model-agnostic approach requiring no modifications to existing DNN architectures.
- Demonstrates significant improvements in robustness with minimal impact on performance.
Read more
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Summary
This paper addresses the vulnerability of Deep Neural Networks (DNNs) in Cyber-Physical Systems (CPS), particularly in the context of False Data Injection Attacks (FDIA) that can disrupt critical operations like state estimation in power grids. The authors propose a novel defense framework called Pseudo-Feature Padding, which introduces an additional input layer that pads input samples with pseudo-feature values derived from the statistical distribution of the input data. This method increases the input dimensionality in a randomized and data-aware manner, making it significantly harder for adversaries to generate effective attacks. The approach is lightweight, model-agnostic, and does not require changes to the core architecture of existing DNNs, facilitating easy deployment in real-world settings. The framework was evaluated using various IEEE test systems (14-bus, 30-bus, 118-bus, and 300-bus) for state estimation, demonstrating that it enhances model robustness against FDIA while maintaining performance integrity with negligible accuracy drop compared to baseline models.
Methodology
The proposed framework integrates an additional input layer that pads input samples with pseudo-feature values. These values are dynamically generated based on the statistical distribution of the input data, identified through tree-based models. The padding is randomized during inference, increasing data diversity and adversarial uncertainty, which complicates the generation of effective FDIA samples.
Results
The evaluation of the proposed framework on the IEEE test systems showed that the pseudo-feature padding significantly improved the robustness of DNNs against FDIA. The method maintained the performance integrity of the models with a negligible drop in accuracy, outperforming conventional defense techniques that failed to mitigate sophisticated FDIA samples.
Implications
The lightweight and model-agnostic nature of the proposed defense framework makes it highly applicable in real-world CPS environments, where securing all sensors is impractical. This approach can enhance the security of critical infrastructure systems against adversarial attacks, ensuring more reliable operation.
Meta-Reinforcement Learning via Evolution for Multi-Objective Combinatorial Supply Chain Optimisation
Reinforcement Learning
Optimization
- MERLION combines population-based evolutionary search with gradient-based meta-learning for enhanced solution diversity.
- The framework maintains multiple meta-policies, allowing for better exploration of the Pareto front in complex supply chain scenarios.
- Empirical results show significant improvements in hypervolume and Pareto front approximation compared to traditional methods.
Read more
Meta-Reinforcement Learning via Evolution for Multi-Objective Combinatorial Supply Chain Optimisation
Summary
This paper introduces MERLION, a novel population-based Meta-Reinforcement Learning (Meta-MORL) framework designed for multi-objective combinatorial supply chain optimization. Traditional Meta-MORL approaches typically rely on a single shared meta-policy, which can limit solution diversity and exploration of the Pareto front in complex environments. MERLION addresses this limitation by maintaining a population of weight vectors, each associated with a distinct meta-policy trained through gradient-based meta-learning. The framework employs evolutionary strategies, including elitist selection, crossover, and mutation, to refine the population of weight vectors based on hypervolume and entropy contributions. The authors evaluate MERLION in a multi-objective supply chain context, focusing on conflicting economic, environmental, and social goals, and also test its generality on standard reinforcement learning problems. The results demonstrate that MERLION achieves more diverse and well-distributed Pareto front approximations, enhances cross-task adaptation, and increases hypervolume by up to 32% compared to existing Meta-MORL methods, while also attaining the lowest average Hausdorff distance among the compared approaches.
Methodology
MERLION employs a population-based approach where multiple weight vectors define scalarisation in the multi-objective space. Each weight vector is associated with a distinct meta-policy, which is trained using gradient adaptation. The population is refined through evolutionary operations guided by fitness scores that capture diversity and optimality, allowing for decentralized learning across different objective regions.
Results
The empirical evaluation of MERLION on three supply chain problems showed that it outperformed five benchmark methods, including traditional Meta-MORL and metaheuristic baselines. The proposed method achieved up to a 32% increase in hypervolume and the lowest average Hausdorff distance, indicating superior performance in approximating the Pareto front.
Implications
The findings suggest that MERLION can significantly enhance decision-making in complex supply chain environments, enabling more effective trade-offs among competing objectives. This approach could be applied in various industries facing multi-objective optimization challenges, particularly where rapid adaptation to changing conditions is critical.
FLFL: Federated Latent Factor Learning for Private Recovery of Spatio-Temporal Signals
Federated Learning
Time Series
Optimization
- FLFL enables accurate recovery of missing data in WSNs while preserving privacy.
- The model utilizes a federated learning framework that minimizes the need for raw data sharing.
- Incorporation of spatio-temporal correlations improves recovery accuracy.
- Extensive experiments show FLFL outperforms existing models in recovery tasks.
Read more
FLFL: Federated Latent Factor Learning for Private Recovery of Spatio-Temporal Signals
Summary
The paper presents FLFL, a novel federated latent factor learning model designed for the privacy-preserving recovery of spatio-temporal signals in wireless sensor networks (WSNs). Traditional latent factor learning approaches require centralized data storage, which raises privacy concerns among data owners. FLFL addresses this by implementing a sensor-level federated learning framework that allows individual sensors to upload only gradient information instead of raw data. This method not only maintains data privacy but also incorporates spatio-temporal correlations as a regularization constraint to enhance recovery accuracy. The authors conducted extensive experiments on four real-world WSN datasets, demonstrating that FLFL significantly outperforms eight state-of-the-art models in terms of recovery accuracy while ensuring privacy. The paper also includes theoretical analyses and proofs to confirm that the gradient information shared does not compromise data privacy.
Methodology
The FLFL model employs a federated learning framework where each sensor node uploads only gradient information for training, thus preserving privacy. It integrates spatio-temporal correlations as a regularization constraint to enhance the accuracy of missing data recovery. The methodology includes theoretical analyses and algorithm design to support the proposed model.
Results
The experimental results indicate that FLFL achieves significantly higher recovery accuracy compared to eight existing federated and non-federated signal recovery models across four real-world WSN datasets. The model effectively balances privacy preservation with data recovery performance.
Implications
FLFL has potential applications in various fields that utilize WSNs, such as smart cities, industrial monitoring, and environmental sensing, where data privacy is crucial. The model can facilitate more secure data sharing and analysis in privacy-sensitive environments.
DCD-PFN: A Decoupling-Aware Foundation Model for Causal Discovery
Graph Learning
Theory
Efficient ML
- DCD-PFN is specifically designed for explicit structural causal discovery, enabling efficient causal graph reconstruction.
- The model employs a decoupling-based local-to-global approach, grounded in theoretical frameworks without restrictive assumptions.
- DCD-PFN demonstrates strong robustness and zero-shot generalization capabilities across various datasets.
- The model addresses computational bottlenecks associated with traditional causal discovery methods.
Read more
DCD-PFN: A Decoupling-Aware Foundation Model for Causal Discovery
Summary
The paper introduces DCD-PFN, a novel foundation model designed for causal discovery, addressing the limitations of traditional algorithms that struggle with complex, non-linear, and noisy data. Traditional causal discovery methods often face computational challenges and are limited by their assumptions about data structures. DCD-PFN leverages a decoupling-based paradigm to focus on local causal discovery, learning sample-wise decoupling weights through pre-training on diverse synthetic Structural Causal Models (SCMs). This approach enables efficient identification of Markov boundaries (MB) and facilitates the reconstruction of global causal graphs. The model's architecture allows for robust zero-shot generalization, making it applicable to both synthetic and real-world datasets. The experiments conducted demonstrate that DCD-PFN achieves strong performance in causal discovery tasks, showcasing its potential as a scalable and efficient solution for understanding complex data-generating mechanisms.
Methodology
DCD-PFN utilizes a decoupling-based paradigm to focus on local causal discovery rather than global graph reconstruction. It learns sample-wise decoupling weights through pre-training on synthetic Structural Causal Models (SCMs) and identifies Markov boundaries efficiently. The model employs parallelized local discovery to reconstruct global causal graphs while adhering to theoretical principles of decoupling-based causal discovery.
Results
Experiments reveal that DCD-PFN achieves robust performance in causal discovery tasks, demonstrating strong zero-shot generalization capabilities on both synthetic and real-world datasets. The model outperforms traditional methods, particularly in scenarios characterized by complex relationships and noise.
Implications
DCD-PFN has significant implications for fields requiring causal inference, such as epidemiology, economics, and social sciences, where understanding complex data-generating processes is crucial. Its efficiency and robustness make it a valuable tool for researchers and practitioners in causal discovery.
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
NLP
Large Language Models
Theory
- Latent Chain-of-Thought models face challenges due to weak learning signals from outcome supervision.
- The dual collapse phenomenon involves gradient attenuation and representational drift in latent spaces.
- Process supervision can be effectively decomposed into Trajectory and Space Supervision.
- Generative reconstruction is more effective than geometric compression for preserving information capacity.
Read more
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
Summary
This paper investigates the challenges of robust latent reasoning in Latent Chain-of-Thought (CoT) models, which represent reasoning through continuous hidden states instead of verbose discrete sequences. The authors identify a dual collapse phenomenon where gradient attenuation and representational drift hinder effective learning. They propose a framework that decomposes process supervision into two dimensions: Trajectory Supervision, which provides dense stepwise reasoning signals, and Space Supervision, which maintains the semantic structure of the latent space. The authors introduce the Unified Latent Probe (ULP) to measure the mutual information between latent trajectories and reasoning steps. Their empirical findings reveal a strong correlation between reasoning accuracy and information fidelity in the latent chain, suggesting a shift from geometric imitation to mutual information maximization for improved latent reasoning supervision.
Methodology
The authors conducted an information-theoretic analysis of Latent CoT, examining the effects of different supervision strategies. They introduced the Unified Latent Probe (ULP) to quantify mutual information and performed empirical experiments to assess the impact of Trajectory and Space Supervision on training stability and reasoning accuracy.
Results
The experiments demonstrated that process supervision significantly stabilizes training, with increased gradient magnitudes indicating effective adaptation. The study found that while geometric compression can collapse the reasoning manifold, generative reconstruction better preserves the information capacity, leading to improved reasoning performance.
Implications
The findings suggest a new framework for supervising latent reasoning in machine learning models, emphasizing the importance of mutual information maximization. This could lead to more effective training strategies for models that rely on latent reasoning, potentially enhancing their performance on complex reasoning tasks.
Breaking chains with trees: Deep learning with $(log N)$ parallel time complexity
Efficient ML
Computer Vision
NLP
- HBLL allows training of deep neural networks without full backpropagation, improving scalability and parallelism.
- The framework achieves O(log N) parallel time complexity, significantly enhancing computational efficiency.
- HBLL demonstrates competitive performance on challenging benchmarks in vision classification and language modeling.
- The method supports flexible inference by defining subnetworks based on hierarchical paths.
Read more
Breaking chains with trees: Deep learning with $(log N)$ parallel time complexity
Summary
This paper introduces Hierarchical Block-Local Learning (HBLL), a novel framework designed to train deep neural networks without the need for full end-to-end backpropagation. Traditional backpropagation suffers from limitations such as locking, which restricts parallel updates across layers, and the weight transport problem, which necessitates symmetric pathways for gradient computation. HBLL addresses these issues by decomposing networks into hierarchically organized blocks that utilize local learning objectives derived from variational principles. This approach allows for effective information propagation while eliminating global gradient dependencies, achieving a parallel time complexity of O(log N), where N is the number of layers. The authors demonstrate the efficacy of HBLL on various vision and language modeling tasks, achieving competitive results on benchmarks like CIFAR-10, CIFAR-100, and WikiText-103. Additionally, they extend HBLL to recurrent neural networks, showcasing its versatility in sequential model training. The framework not only enhances computational efficiency but also enables flexible inference by defining a family of subnetworks corresponding to different hierarchical paths.
Methodology
The authors propose a framework called Hierarchical Block-Local Learning (HBLL) that decomposes deep neural networks into hierarchically linked blocks. Each block is trained using local learning objectives derived from a variational formulation, allowing for distributed training without global error propagation. The framework employs invertible transformations to facilitate information flow across the hierarchy, thus avoiding the interdependencies typical of traditional backpropagation.
Results
HBLL was evaluated on several tasks, including vision classification on CIFAR-10 and CIFAR-100, as well as autoregressive language modeling on WikiText-103. The results indicate that HBLL achieves competitive performance compared to traditional backpropagation methods while significantly reducing computational overhead and enabling parallel training.
Implications
The proposed HBLL framework has the potential to revolutionize deep learning by enabling more efficient training of large-scale neural networks. Its ability to operate without full backpropagation could lead to reduced computational costs and energy consumption, making it suitable for deployment in resource-constrained environments. Additionally, the flexibility in inference could enhance model adaptability in various applications.
Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients
Theory
Optimization
- Introduction of the Shift-Invariant Variance Estimator (SIVE) to bypass minimization bias in LLC estimation.
- SIVE structurally eliminates the need for the local minimum by using variance and a noise-debiasing correction.
- Controlled experiments validate SIVE's effectiveness in recovering geometric signals in off-equilibrium settings.
- SIVE is scalable to deep neural networks, enabling real-time tracking of structural phase transitions during training.
Read more
Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients
Summary
This paper addresses a critical limitation in the application of Singular Learning Theory (SLT) for estimating Local Learning Coefficients (LLC) during off-equilibrium training phases of neural networks. Traditional LLC estimators rely on knowing the local minimum of the loss landscape, which is often inaccessible during training. The authors introduce the Shift-Invariant Variance Estimator (SIVE), which circumvents the need for this local minimum by utilizing a variance-based approach. SIVE employs the variance operator to eliminate the unknown additive baseline that typically introduces minimization bias. The authors derive a correction based on the Law of Total Variance to separate true geometric fluctuations from noise in mini-batch evaluations. Through controlled experiments on toy models, SIVE demonstrates its ability to recover expected geometric signals where conventional mean estimators fail. When applied to deep neural networks, SIVE serves as a robust diagnostic tool for tracking structural phase transitions throughout the training process, providing insights into the dynamics of the loss landscape.
Methodology
The methodology involves formulating LLC estimation as an off-equilibrium local probe using the variance operator. The authors derive an explicit correction for mini-batch noise using the Law of Total Variance, allowing SIVE to decouple geometric fluctuations from stochastic evaluation noise. The approach is validated through controlled experiments on toy models and applied to deep neural networks.
Results
SIVE successfully recovers the expected finite-temperature geometric signals in scenarios where traditional mean estimators fail. It also demonstrates the capability to track structural phase transitions in deep neural networks throughout the training process.
Implications
The development of SIVE provides a new tool for researchers to analyze the geometry of loss landscapes in neural networks, particularly during off-equilibrium training phases. This can enhance understanding of training dynamics and improve diagnostic capabilities in machine learning.
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Generative Models
Optimization
Multimodal
- PG-MAP is the first framework to jointly optimize conditioning and latent variables during inference-time alignment.
- The framework employs a forward-consistency coupling, allowing coordinated updates across modalities.
- PG-MAP shows consistent improvements in alignment metrics across different diffusion models.
- Human evaluations indicate a strong preference for outputs generated using PG-MAP compared to existing baselines.
Read more
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Summary
The paper introduces PG-MAP, a novel framework for inference-time alignment of pretrained text-to-image models that addresses the limitations of existing methods which typically optimize along a single control axis. PG-MAP formulates the alignment problem as a trajectory-level Gibbs-MAP optimization, allowing for simultaneous updates of conditioning and latent variables during the denoising process. This joint optimization is guided by a frozen preference reward and is compatible with both diffusion and flow-matching models. The framework enhances alignment metrics such as PickScore and Aesthetic across various diffusion backbones and demonstrates significant performance improvements when combined with tuned classifier-free guidance. For flow-matching models, PG-MAP achieves high PickScore and human preference rates, confirming its effectiveness. The analysis reveals that the importance of conditioning and latent optimization varies with prompt types, suggesting further optimization opportunities.
Methodology
PG-MAP utilizes a training-free approach that recasts the denoising process as a proximal MAP problem, optimizing both conditioning and latent variables at each denoising step. It employs a schedule-adaptive trust region and a step-dependent active set to refine the variables dynamically, ensuring that updates are coordinated rather than additive. The framework is adaptable to both diffusion and flow-matching models, with specific adaptations for each transport type.
Results
PG-MAP achieves significant improvements in alignment metrics, with reported PickScore of 91.9% and HPS win rates of 75.7% against static baselines for flow-matching models. In human evaluations, PG-MAP garnered 60-67% preference over strong baselines, indicating its effectiveness in generating higher-quality images.
Implications
The PG-MAP framework has the potential to enhance the performance of text-to-image generation models, making them more effective in producing coherent and aesthetically pleasing images. Its ability to adaptively optimize during inference could lead to advancements in various applications involving generative models, particularly in creative fields such as art and design.
Comparing Linear Probes with Mahalanobis Cosine Similarity
Interpretability
Theory
Large Language Models
- MCS provides a near-perfect linear prediction of OOD AUROC across multiple models and datasets.
- Theoretical proof establishes the linear relationship between MCS and OOD AUROC under specific conditions.
- MCS outperforms ECS significantly in terms of correlation with probe performance.
- The study identifies failure modes for the linearity of MCS and OOD AUROC, enhancing understanding of probe generalization.
Read more
Comparing Linear Probes with Mahalanobis Cosine Similarity
Summary
This paper investigates the relationship between linear probes and Mahalanobis cosine similarity (MCS), proposing MCS as a more effective measure for comparing linear probes in interpretability research. The authors extend previous findings that MCS correlates strongly with out-of-distribution (OOD) area under the receiver operating characteristic curve (AUROC), demonstrating this relationship across various models, layers, and datasets. They provide a theoretical framework explaining the linear relationship between MCS and OOD AUROC, showing that both metrics are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR). The study also identifies conditions under which this linearity may fail, verified through empirical tests. The findings suggest that MCS is a theoretically grounded and empirically robust alternative to traditional Euclidean cosine similarity (ECS) for evaluating linear probes.
Methodology
The authors employed logistic regression probes trained on in-distribution (ID) and out-of-distribution (OOD) datasets across various models (Llama-70B, Llama-8B, Qwen-7B) and layers. They calculated MCS using the covariance of the OOD data and compared it to the traditional ECS. The relationship between MCS and OOD AUROC was analyzed through empirical validation and theoretical derivation.
Results
The study found that MCS consistently predicts OOD AUROC with R2 values exceeding 0.93 across different models, layers, and datasets, while ECS showed significantly lower R2 values (as low as 0.06). The theoretical framework confirmed that under balanced classes and Gaussian projections, the relationship between MCS and OOD AUROC is linear, with empirical verification of conditions leading to deviations from this linearity.
Implications
The findings suggest that MCS can enhance the interpretability of machine learning models by providing a more reliable metric for comparing linear probes. This could lead to better understanding of model generalization and performance, particularly in applications requiring robust interpretability across varying datasets.
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Multimodal
- Introduces a novel multimodal approach combining 3D MRI and PET for AD diagnosis.
- Utilizes three fusion strategies and a sparsely gated Mixture-of-Experts classifier.
- Achieves high classification accuracies across multiple diagnostic tasks.
- Implements Grad-CAM for enhanced model interpretability.
Read more
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Summary
This paper addresses the critical need for early diagnosis of Alzheimer's Disease (AD) by leveraging multimodal neuroimaging data, specifically 3D MRI and PET scans. The authors highlight the limitations of existing models that typically use static concatenation of MRI and PET data, which can hinder robustness and computational efficiency. To overcome these challenges, the study introduces a novel approach that combines 3D convolutional feature extractors with three fusion strategies: concatenation, Gated Multimodal Unit (GMU), and gated self-attention. Additionally, a sparsely gated Mixture-of-Experts (MoE) classifier is employed to dynamically route inputs to the most relevant experts, enhancing model adaptability to patient heterogeneity. The model's interpretability is further improved through the use of Grad-CAM for visualizing disease-related brain regions. The methodology is tested across three binary classification tasks: Normal Cognition (NC) vs. Mild Cognitive Impairment (MCI), MCI vs. AD, and NC vs. AD. The results demonstrate that the GMU achieves accuracies of 80.46% for NC vs. MCI and 95.47% for NC vs. AD, while gated self-attention reaches 82.08% for MCI vs. AD. Ablation studies confirm the importance of the MoE in maintaining high accuracy across all tasks, underscoring the effectiveness of input-adaptive multimodal modeling in AD diagnosis.
Methodology
The study employs a series of preprocessing steps on MRI and PET images, followed by feature extraction using a 3D convolutional neural network (CNN). Three fusion techniques (concatenation, GMU, and gated self-attention) are applied to capture inter- and intra-modal interactions. A Mixture-of-Experts model is integrated for dynamic routing of inputs, and Grad-CAM is used for visualizing decision-making processes.
Results
The proposed model achieves accuracies of 80.46% for NC vs. MCI, 95.47% for NC vs. AD, and 82.08% for MCI vs. AD. Ablation studies indicate that removing the MoE consistently degrades performance across all tasks, highlighting its significance in the model's architecture.
Implications
The findings suggest that integrating multimodal neuroimaging data with adaptive modeling techniques can significantly enhance the early diagnosis of Alzheimer's Disease, potentially leading to better patient outcomes through timely interventions.
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
Generative Models
Reinforcement Learning
Multimodal
- DiT-Reward effectively repurposes a pretrained text-to-image DiT as a reward model.
- The method outperforms existing models like HPSv3 on multiple preference benchmarks.
- A lightweight head can extract meaningful preference predictions even when the generative backbone is frozen.
- Reward performance benefits from representations in the middle-to-late layers of the transformer.
Read more
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
Summary
The paper introduces DiT-Reward, a novel approach that repurposes a pretrained text-to-image Diffusion Transformer (DiT) as a reward model for evaluating generated images. The authors explore whether the representations learned during image generation can also be utilized for reward prediction, thereby enhancing the evaluation of generated outputs. DiT-Reward processes near-clean image latents and aggregates text-conditioned image representations across transformer layers. The method demonstrates superior performance compared to existing models, specifically HPSv3, on multiple preference benchmarks, achieving notable scores of 85.6% on HPDv2 and 77.6% on HPDv3. The findings indicate that even with a frozen generative backbone, a lightweight learned head can effectively predict preferences. The study also reveals that the most effective reward performance is derived from the middle-to-late layers of the transformer, and that increasing the generative backbone's capacity consistently improves results. Additionally, when applied to optimize Stable Diffusion 3.5 Large, DiT-Reward shows significant improvements in realism and achieves a 1.65ร speedup in inference over HPSv3, while maintaining comparable memory usage. Overall, the research highlights the potential of pretrained generative models in reward modeling and policy optimization.
Methodology
DiT-Reward converts a pretrained text-to-image Diffusion Transformer into a reward model by encoding input images into a latent space, applying near-clean perturbations, and extracting text-conditioned image token representations across transformer layers. A lightweight MLP is used to map pooled features to scalar rewards. The model is evaluated on preference benchmarks and compared against existing reward models.
Results
DiT-Reward outperforms HPSv3 on all evaluated benchmarks, achieving 85.6% on HPDv2 and 77.6% on HPDv3. The method shows that even with a frozen backbone, it can still provide meaningful preference predictions. Additionally, it demonstrates a 1.65ร speedup in inference time compared to HPSv3 while maintaining similar peak memory usage.
Implications
The findings suggest that pretrained generative models can be effectively utilized for reward modeling, potentially improving the evaluation and optimization of generative models in text-to-image tasks. This approach may lead to more efficient and effective reinforcement learning strategies in multimodal contexts.
When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage
Federated Learning
Computer Vision
Theory
- Quantifies the marginal-conditional coverage gap in federated CRC for medical segmentation.
- Proposes a shrinkage-based federated CRC protocol that enhances prediction set efficiency.
- Demonstrates that naive pooling of calibration scores can lead to critical failures in individual site coverage.
- Identifies the necessity of finite-sample correction terms to avoid excessive violations.
Read more
When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage
Summary
This paper addresses the challenges of deploying conformal risk control (CRC) in federated learning settings, particularly in medical segmentation tasks across multiple hospitals. The author highlights that traditional methods of pooling calibration scores can lead to significant violations of coverage guarantees at individual institutions, despite appearing well-calibrated on average. Using a dataset of 1,251 brain tumor volumes from 20 institutions, the study quantifies the marginal-conditional coverage gap, revealing that 40% of hospitals exceed the target false-negative rate. The paper proposes a novel shrinkage-based federated CRC protocol that allows each site to transmit only its empirical risk curve to a central server, which then computes a shrinkage-regularized threshold for each site. This approach effectively balances coverage and prediction set efficiency, achieving a significant reduction in violations while maintaining the integrity of the CRC guarantees. The findings underscore the importance of considering site-specific characteristics in federated learning applications in healthcare.
Methodology
The study employs a shrinkage-based federated CRC protocol where each site computes its local empirical risk curve and transmits it to a central server. The server then computes a shrinkage-regularized threshold for each site, allowing for a trade-off between worst-case coverage and prediction set efficiency. The method includes sensitivity analysis to identify optimal hyperparameters.
Results
The proposed method significantly reduces the number of coverage violations from 8 out of 20 institutions to only 2.7 violations at a 2.0ร stretch, demonstrating improved efficiency in prediction sets while maintaining coverage guarantees. The study also shows that direct Lagrangian optimization fails to protect vulnerable hospitals, emphasizing the importance of the finite-sample correction term.
Implications
The findings have significant implications for the deployment of machine learning models in healthcare, particularly in ensuring reliable segmentation outputs across diverse hospital settings. The proposed method can enhance the safety and effectiveness of clinical decision-making by providing better-calibrated predictions.
One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields
Generative Models
- Introduction of a novel flow matching model for generating path-dependent stress fields.
- Utilization of a transformer backbone for improved long-range dependency modeling.
- Significant computational efficiency improvements over traditional finite element methods.
- Ability to generate high-resolution fields in a single step without extensive sampling.
Read more
One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields
Summary
This paper addresses the computational challenges associated with simulating path-dependent physical fields, particularly in the context of plastic stress fields. Traditional methods, such as finite element analysis (FEM), are often computationally expensive and inefficient for complex geometries and constitutive models. The authors propose a novel flow matching (FM) model based on a transformer architecture, which operates within the latent space of a variational autoencoder (VAE). This model formulates the generation of plastic stress fields as a video synthesis task, allowing for the direct generation of stress fields across all time steps in a single step. The authors introduce a non-Gaussian source distribution to minimize crossings among conditional transport paths during training, enhancing the model's performance. Additionally, token-level loading embeddings and auxiliary networks are incorporated to further improve simulation accuracy. The results indicate that the proposed model can generate high-resolution path-dependent fields efficiently, achieving a speedup of 6 to 7 times over traditional FEM on CPUs and approximately two orders of magnitude on consumer-grade GPUs, even with limited training data.
Methodology
The authors developed a flow matching model using a transformer architecture, operating in the latent space of a VAE. They formulated the generation of plastic stress fields as a video synthesis task, employing a non-Gaussian source distribution to optimize training and reduce path crossings. Token-level loading embeddings and auxiliary networks were also introduced to improve performance.
Results
The proposed model demonstrated the capability to accurately generate high-resolution path-dependent stress fields with a significant reduction in computational cost, achieving speedups of 6 to 7 times on CPUs and nearly two orders of magnitude on consumer-grade GPUs compared to traditional FEM, even with a limited dataset.
Implications
The findings suggest that the proposed flow matching model can revolutionize the simulation of path-dependent physical fields in engineering, providing a faster and more efficient alternative to traditional methods. This has potential applications in mechanical, aerospace, and civil engineering, where accurate and rapid simulations are critical.
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
Theory
Optimization
- Introduces Riemannian sharpness as an invariant measure of flatness under reparametrization.
- Establishes a connection between SGD's implicit bias and Riemannian flat minima through a derived SDE.
- Demonstrates a PAC-Bayes generalization bound explicitly controlled by Riemannian sharpness.
- Empirical validation shows Riemannian sharpness better predicts generalization than Euclidean sharpness.
Read more
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
Summary
This paper addresses the widely accepted notion that stochastic gradient descent (SGD) favors flat minima, which are believed to generalize better in deep learning contexts. The authors critique existing measures of flatness, such as the trace or maximum eigenvalue of the loss Hessian, for lacking invariance under reparametrizations that preserve the network function. To resolve this, they introduce a framework based on Riemannian geometry using the Fisher Information Matrix (FIM) to define Riemannian sharpness, which is invariant under such reparametrizations. The study formalizes the gradient noise of mini-batch SGD as having a covariance structure proportional to the FIM, deriving the stationary distribution of the stochastic differential equation (SDE) and demonstrating that probability mass is concentrated at Riemannian-flat minima. A PAC-Bayes generalization bound is established, linking Riemannian sharpness to test performance. Empirical results on MNIST and CIFAR-10 validate that Riemannian sharpness correlates with generalization more reliably than traditional Euclidean sharpness. This work unifies invariant sharpness with SGD's implicit bias and generalization bounds, providing a robust theoretical foundation for understanding why flat minima generalize well.
Methodology
The authors define Riemannian sharpness using the Fisher Information Matrix (FIM) to ensure invariance under reparametrization. They formalize the gradient noise of mini-batch SGD and derive the stationary distribution of the resulting stochastic differential equation (SDE). They also establish a PAC-Bayes generalization bound linked to Riemannian sharpness.
Results
The study proves that mini-batch SGD assigns exponentially greater probability mass to Riemannian-flat minima. The empirical results on MNIST and CIFAR-10 confirm that Riemannian sharpness correlates with generalization performance, outperforming traditional Euclidean sharpness metrics.
Implications
This work provides a more robust theoretical framework for understanding the behavior of SGD in deep learning, potentially guiding the design of optimization algorithms that favor flat minima for improved generalization. It also opens avenues for further research into invariant measures of model performance.
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
Robotics
Efficient ML
Multimodal
- VLA-FAIL is a lightweight framework for detecting task failures in Vision-Language-Action models.
- It combines two novel detection methods: LLMD for out-of-distribution state detection and ACC for action consistency monitoring.
- The framework requires no failure data and incurs minimal computational overhead.
- AUCPDT is introduced as a new metric to evaluate detection accuracy and latency.
Read more
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
Summary
The paper introduces VLA-FAIL, a novel framework designed for efficient task failure detection in finetuned Vision-Language-Action (VLA) models, which are known for their state-of-the-art performance in robotic manipulation tasks. Despite their capabilities, VLAs can exhibit unpredictable behavior in out-of-distribution scenarios, making runtime failure detection crucial for safe deployment. Existing methods often rely on computationally expensive action sampling or require failure data, which can be impractical. VLA-FAIL addresses these challenges by combining two lightweight failure detection techniques: Last-Layer Mahalanobis Distance (LLMD) and Action Chunk Consistency (ACC). LLMD measures deviations in last-layer features from the training data to detect out-of-distribution states, while ACC assesses the consistency of action chunks over time to identify erratic behavior. The framework is designed to operate with minimal computational overhead and does not require access to failure data. The authors also introduce a new evaluation metric, AUCPDT, which captures the trade-off between detection accuracy and latency. Through extensive experiments in both real-world and simulated environments, VLA-FAIL demonstrates robust performance, often surpassing more complex baseline methods in early and reliable failure detection across various tasks.
Methodology
The methodology involves two main components: Last-Layer Mahalanobis Distance (LLMD), which detects out-of-distribution states by analyzing token-wise deviations in last-layer features, and Action Chunk Consistency (ACC), which evaluates the consistency of overlapping action chunks in a receding-horizon control framework. The combination of these two methods allows for effective monitoring of task execution without the need for failure data.
Results
The results indicate that VLA-FAIL effectively captures complementary failure modes, leading to reliable and early detection of task failures across six diverse manipulation tasks. The framework frequently outperforms significantly more expensive baseline methods, demonstrating its efficiency and robustness.
Implications
The implications of this work extend to the safe deployment of VLA models in real-world robotic applications, where early detection of failures can facilitate human intervention and improve overall system reliability. The lightweight nature of VLA-FAIL makes it suitable for real-time applications in robotics.
The Cost Geometry of Belief: finite-resource inference under noisy observation
Theory
- Introduces a cost geometry for beliefs based on optimal transport and Fisher information.
- Establishes that certainty is an unattainable boundary in finite-resource inference.
- Identifies three key results: a wall of certainty, an honesty condition, and a rigidity in belief geometries.
- Demonstrates that the Gaussian distribution is the most hyperbolic belief in this framework.
Read more
The Cost Geometry of Belief: finite-resource inference under noisy observation
Summary
This paper introduces a novel framework for understanding the geometry of beliefs in the context of finite-resource inference under noisy observations. The author proposes a cost geometry that quantifies the cost of transitioning between beliefs, utilizing optimal transport in Wasserstein space, adjusted by Fisher information to reflect the precision of beliefs. The study highlights that certainty, represented as a perfect twin of a system, is unattainable due to both observational and physical constraints, which are encapsulated by Fisher information. The author presents three main results: (1) a 'wall' indicating that well-posed inference requires certainty to be infinitely distant when costs dominate Fisher information; (2) an 'honesty' condition where uniform cost leads to geometries proportional to Fisher information; and (3) a 'rigidity' result showing that these geometries are hyperbolic, with the Gaussian distribution being the most hyperbolic belief. The implications of this work extend to algorithmic applications, such as Kalman filters, which maintain uncertainty and revise beliefs at finite costs, contrasting with systems that operate at the boundary of certainty.
Methodology
The paper employs a theoretical approach, utilizing concepts from optimal transport, information geometry, and Bayesian inference to characterize the geometry of beliefs. It formalizes the relationship between belief transitions and costs, integrating Fisher information to define a cost metric that governs the movement within the belief space.
Results
The main results include the identification of a wall that prevents certainty from being reached, an honesty condition that aligns cost with Fisher information, and the rigidity of hyperbolic geometries in belief space. The findings suggest that the cost of achieving a certain level of precision diverges as one approaches certainty, establishing a geometric floor for belief transitions.
Implications
This framework has significant implications for fields that rely on inference under uncertainty, such as robotics, data assimilation, and machine learning. It provides a geometric perspective on belief dynamics that can enhance algorithms like Kalman filters, improving their efficiency and robustness in uncertain environments.
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Computer Vision
Large Language Models
Multimodal
- Physiology-aware CNN models outperform zero-shot multimodal LLMs in ECG image classification.
- LeadGroupECG model effectively captures anatomical relationships among ECG leads.
- CNN models achieved high ROC-AUC scores, indicating strong classification performance.
- Zero-shot LLMs showed near-chance performance, highlighting limitations in ECG interpretation.
Read more
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Summary
This study investigates the effectiveness of zero-shot multimodal large language models (LLMs) and physiology-aware convolutional neural networks (CNNs) in classifying 12-lead ECG images into normal and abnormal categories. The authors highlight the unique challenges of ECG image interpretation, which relies on precise waveform morphology and lead relationships, distinguishing it from general image classification tasks. They developed a novel CNN model, LeadGroupECG, designed to aggregate features from anatomical lead groups, and compared its performance against established CNN architectures (ResNet18, DenseNet121, VGG16) and three prominent LLMs (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) under zero-shot conditions. The study found that while CNN models achieved high classification accuracy (ROC-AUC of 0.92โ0.94 internally and 0.85โ0.86 externally), the LLMs performed poorly, with ROC-AUC scores around 0.5. The results suggest that despite the narrative generation capabilities of LLMs, their diagnostic performance in ECG interpretation remains limited, emphasizing the need for domain-specific architectures in clinical applications.
Methodology
The study utilized a large-scale public dataset of 12-lead ECG recordings rendered as single-page images for binary classification. The proposed LeadGroupECG model was developed to aggregate features from anatomical lead groups and was compared against baseline CNN models. Three LLMs were evaluated under fixed zero-shot prompts across multiple runs. All models were trained and tested using identical protocols on both internal and external datasets.
Results
CNN-based models demonstrated stable performance with internal ROC-AUC scores ranging from 0.92 to 0.94 and external scores between 0.85 and 0.86. The LeadGroupECG model significantly improved upon its backbone without sacrificing external generalization. In contrast, the zero-shot LLMs achieved ROC-AUC scores around 0.5, indicating poor classification ability.
Implications
The findings suggest that while multimodal LLMs can generate contextual narratives for ECG images, they are not reliable for diagnostic tasks without task-specific training. This highlights the necessity for clinically grounded, domain-specific architectures in AI-based ECG interpretation, which could enhance diagnostic accuracy and support clinical decision-making.
Spectral Retrieval-Augmented Time-Series Forecasting
Time Series
- Introduction of SpecReTF, a novel retrieval-augmented forecasting architecture.
- Combines frequency-domain analysis with recency-weighted pattern retrieval.
- Unified similarity measure integrates JensenโShannon divergence and cosine similarity.
- SpecReTF achieves state-of-the-art forecasting accuracy on benchmark datasets.
Read more
Spectral Retrieval-Augmented Time-Series Forecasting
Summary
This paper introduces SpecReTF, a novel retrieval-augmented time-series forecasting method that addresses the limitations of traditional forecasting approaches when dealing with complex, non-stationary patterns. Traditional methods often struggle to capture rare or complex patterns due to their reliance on learned representations, leading to issues such as spectral blindness and temporal recency. SpecReTF overcomes these challenges by converting time series into windowed frequency representations and employing a combined similarity metric that incorporates both amplitude and phase information. Additionally, it utilizes an exponential moving average weighting scheme to prioritize recent patterns over older data. Extensive experiments on benchmark datasets demonstrate that SpecReTF significantly outperforms existing time-domain retrieval methods, achieving superior forecasting accuracy across various non-stationary time series. The proposed method not only enhances the retrieval process by accurately capturing periodic behaviors but also maintains sensitivity to new patterns, thereby improving the overall forecasting performance.
Methodology
SpecReTF converts time series segments into the frequency domain using Short-time Fourier Transform (STFT). It computes a composite similarity score by combining JensenโShannon divergence for amplitude distributions with cosine similarity for phase alignment. An exponential moving average weighting scheme is applied to prioritize recent windows while retaining long-term patterns.
Results
The experiments conducted on multiple benchmark datasets show that SpecReTF consistently achieves superior forecasting accuracy compared to leading retrieval-based and purely model-based methods, establishing new state-of-the-art results in time-series forecasting.
Implications
The findings suggest that SpecReTF can be effectively applied in various domains requiring time-series forecasting, such as finance, energy consumption, and healthcare monitoring, where capturing non-stationary patterns is crucial for accurate predictions.
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
NLP
Large Language Models
Computer Vision
- Introduces a scalable framework for linear mode connectivity in billion-parameter pretrained transformers.
- Utilizes parameterized weight transformations and a dual learning procedure for effective model merging.
- Achieves near-zero loss barriers on WikiText and maintains high accuracy on ImageNet during interpolation.
- Demonstrates the importance of resolving parameter symmetries in enhancing model connectivity.
Read more
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Summary
This paper addresses the challenge of merging independently trained neural networks, particularly billion-parameter pretrained transformers, by enhancing the concept of linear mode connectivity (LMC). The authors propose a novel framework that utilizes parameterized functionality-preserving weight transformations to align functionally equivalent solutions across models. A dual learning procedure is introduced, allowing both models to jointly learn transformations towards a shared linear interpolation path. This bidirectional optimization significantly reduces interpolation barriers, enabling reliable merging of large-scale architectures. Empirical results demonstrate that the proposed method achieves near-zero loss barriers on WikiText for medium-sized language models and maintains over 69% ImageNet top-1 accuracy for ViT-L throughout the interpolation path. The findings suggest that resolving parameter symmetries allows for effective connectivity and merging of large pretrained transformers through linear paths, improving interpolation performance.
Methodology
The authors developed a framework that includes a broad family of functionality-preserving weight transformations for transformers. These transformations are parameterized under structural constraints and optimized directly with respect to the loss along the interpolation path. The dual learning procedure allows both endpoint models to learn their respective symmetry transformations, facilitating a more effective alignment and reducing interpolation barriers.
Results
The proposed method achieved near-zero loss barriers on WikiText for medium-sized language models and maintained over 69% top-1 accuracy on ImageNet for ViT-L across the interpolation path. This indicates a significant improvement in interpolation performance and model merging reliability for large pretrained transformers.
Implications
The findings suggest that the proposed framework can enhance the scalability and effectiveness of model merging in large pretrained transformers, potentially leading to more efficient model reuse and composition in various applications such as natural language processing and computer vision.
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models
NLP
Large Language Models
- SamatNext v0.2-B demonstrates improved retention of prior capabilities compared to a standard Transformer baseline.
- The hybrid architecture effectively balances retention and plasticity in curriculum learning settings.
- Despite improvements, both models face challenges with catastrophic forgetting, particularly in early-stage syntax tasks.
- The study emphasizes the importance of structured curriculum learning in training adaptive models.
Read more
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models
Summary
This paper presents SamatNext v0.2-B, a 356-million-parameter hybrid sequence decoder designed to mitigate catastrophic forgetting in autoregressive Transformer models trained on sequential curriculum distributions. The model integrates Differential-Attention-style layers and DeltaNet-inspired simplified linear-state mixer layers, enhanced with RMS Normalization and Output Scale Calibration. The study evaluates the model's performance on a structured Python code curriculum, comparing it to a parameter-matched Transformer baseline. Results indicate that SamatNext v0.2-B achieves a 100% pass rate on a controlled Stage 5 holdout while retaining 98.8% of its capabilities from an adjacent Stage 3. In contrast, the baseline Transformer achieves a 49.4% pass rate on Stage 5 but retains only 3.8% of Stage 3 performance. Even with an increased learning rate, the baseline struggles with retention. Both models exhibit performance degradation on early-stage adversarial syntax tasks, highlighting the ongoing challenge of catastrophic forgetting in long-horizon training scenarios. The authors provide code, model specifications, and evaluation scripts for independent verification, framing this work as an exploratory study rather than a definitive solution to the problem.
Methodology
The methodology involves training the SamatNext v0.2-B model on a multi-stage Python curriculum, alternating between Differential-Attention and linear-state mixer layers. The model's performance is evaluated against a parameter-matched Transformer baseline using controlled pass/fail metrics.
Results
SamatNext v0.2-B achieved a 100% pass rate on Stage 5 and retained 98.8% of its Stage 3 capabilities, while the baseline Transformer achieved only a 49.4% pass rate on Stage 5 with a mere 3.8% retention of Stage 3 performance. Both models struggled with early-stage syntax tasks, indicating limitations in addressing catastrophic forgetting.
Implications
The findings suggest that hybrid architectures may offer a pathway to improve retention in sequential learning tasks, particularly in code generation. This could have implications for developing more adaptive and continually learning systems in programming and other domains.
Learning a Normal World Model for Few-Shot Boundary-Calibrated Abnormality Detection
Time Series
- Introduces a normal world modeling framework for few-shot abnormality detection.
- Develops an entropy-aware normal-world energy for quantitative evaluation of abnormality.
- Demonstrates strong performance on the NASA C-MAPSS turbofan degradation benchmark.
- Mechanistic validation tests confirm the model captures the structure of normal behavior.
Read more
Learning a Normal World Model for Few-Shot Boundary-Calibrated Abnormality Detection
Summary
This paper addresses the challenges of abnormality detection in complex systems, particularly the scarcity of abnormal labels and the inadequacy of binary labels to quantify deviations from normal behavior. The authors propose a novel approach called the Hypergraph Entropic Normal-World Model, which learns a representation of the normal world from abundant normal events while using a few abnormal examples solely to calibrate the boundary of normality. The model constructs context-conditioned hypergraphs to capture high-order relationships among multivariate sensor data and defines abnormality through an entropy-aware normal-world energy that integrates temporal prediction surprise, hypergraph consistency surprise, and latent normal-manifold departure. The proposed method is evaluated on the NASA C-MAPSS turbofan degradation benchmark, demonstrating strong performance in zero-shot and few-shot scenarios, particularly achieving an AUROC of 0.9983 on the most complex subset. Mechanistic validation tests indicate that the learned energy effectively encodes the structure of the normal world, providing a robust anomaly score and a graded risk measure under conditions of severe abnormal-label scarcity.
Methodology
The methodology involves constructing a Hypergraph Entropic Normal-World Model that represents multivariate sensor data as context-conditioned hypergraphs. The model learns three aspects of normality: temporal dynamics, hypergraph consistency, and latent manifold representation. It combines these into an entropy-aware normal-world energy, which serves as a normality score. During few-shot calibration, the model parameters remain fixed, adjusting only the decision threshold based on a few abnormal examples.
Results
The proposed model achieved an AUROC of 0.9983 on the FD004 subset of the NASA C-MAPSS dataset, indicating exceptional performance in detecting abnormalities. The results also showed strong zero-shot and few-shot performance across all subsets, with mechanistic tests validating that the learned energy captures the underlying normal-world structure.
Implications
The findings suggest that the normal-world energy can be utilized as an effective anomaly score and a graded risk measure, making it applicable in various fields such as industrial fault diagnosis, clinical monitoring, and cyber-physical systems, especially in scenarios with limited abnormal data.
ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation
Large Language Models
Reinforcement Learning
NLP
- Incorrect student-generated outputs can provide more valuable training signals than correct ones in OPD.
- ReNIO introduces a prefix-computable reweighting method that emphasizes negative trajectories without needing final-answer labels.
- The method leverages student-to-teacher probability ratios to identify and weight pivotal tokens leading to incorrect reasoning.
- ReNIO shows substantial performance improvements in mathematical reasoning and code generation tasks.
Read more
ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation
Summary
The paper introduces ReNIO, a novel method aimed at enhancing on-policy distillation (OPD) for large language models (LLMs) by reweighting the importance of negative trajectories. Traditional OPD treats all student-generated outputs (SGOs) equally, which overlooks the potential value of incorrect SGOs. The authors conducted experiments revealing that training solely on incorrect SGOs often yields better performance than training on correct ones. This is attributed to the exploratory reasoning preserved in incorrect SGOs, which can provide valuable insights into the model's failure modes. ReNIO addresses the challenge of emphasizing informative negative trajectories without requiring full answer rollouts by employing a sample-level reweighting strategy based on the student-to-teacher probability ratio. This method identifies pivotal tokens that lead to incorrect reasoning and assigns them higher weights, thus enhancing the learning signal from negative trajectories while maintaining the efficiency of OPD. The results demonstrate that ReNIO significantly improves both OPD and on-policy self-distillation (OPSD) across mathematical reasoning and code generation tasks, showcasing its effectiveness in optimizing LLM training.
Methodology
ReNIO employs a sample-level reweighting approach that uses the ratio of student-to-teacher probabilities to identify pivotal tokens associated with incorrect reasoning. It aggregates these ratios to assign normalized weights to SGOs, emphasizing those that contain strong corrective signals while preserving the efficiency of OPD.
Results
ReNIO achieved relative performance gains of up to 8.90% for the Qwen3-1.7B model and 10.00% for the R1-Distill-Qwen-7B model on mathematical reasoning benchmarks, demonstrating its effectiveness in improving OPD and OPSD.
Implications
The findings suggest that incorporating negative trajectory information can enhance the training of LLMs, leading to more robust reasoning capabilities. This approach could be applied to various reasoning tasks and potentially improve the efficiency of training processes in LLMs.
Deep Learning for Soil Moisture Estimation: Fusing Satellite Data with Optimally-Lagged Meteorological Features
Time Series
- Optimal meteorological and inter-depth lags were identified using Cross-Correlation Function (CCF).
- A per-pixel CNN model showed significant improvement in soil moisture prediction when combined with depth features.
- The CNN-LSTM hybrid model achieved the best overall performance in held-out data evaluation.
- Incorporating subsurface depth information was crucial for enhancing prediction accuracy.
Read more
Deep Learning for Soil Moisture Estimation: Fusing Satellite Data with Optimally-Lagged Meteorological Features
Summary
This paper addresses the challenge of accurately estimating soil moisture in semi-arid agricultural regions by integrating satellite data and meteorological information while accounting for the temporal delays in soil moisture response to atmospheric conditions. The authors introduce a Cross-Correlation Function (CCF) methodology to identify optimal lags (0โ30 days) for meteorological variables and inter-depth lags (0โ15 days) for vertical moisture propagation. The study evaluates three deep learning architectures: a per-pixel CNN for detailed estimation, an LSTM for daily plot-mean predictions, and a CNN-LSTM hybrid for pooled multi-patch training. The models were validated across seven agricultural plots in southeastern Spain, demonstrating that incorporating meteorological variables and subsurface depth information significantly enhances prediction accuracy. The results indicate that the CNN-LSTM hybrid achieved the highest performance (Rยฒ = 0.930, CVRMSE = 8%), showcasing the importance of modeling atmospheric and vertical delays for effective soil moisture estimation.
Methodology
The study employed a Cross-Correlation Function (CCF) to determine optimal temporal lags for meteorological variables and inter-depth lags for soil moisture. Three deep learning architectures were evaluated: a CNN for per-pixel estimation, an LSTM for daily plot-mean predictions, and a CNN-LSTM hybrid for pooled multi-patch training. The models were validated using a date-grouped split to prevent data leakage.
Results
The per-pixel CNN achieved a strong single-patch result (Rยฒ = 0.877, RMSE = 2.28), while the average Rยฒ across seven patches improved to 0.535 with depth features. The CNN-LSTM hybrid model outperformed all others with an Rยฒ of 0.930 and a CVRMSE of 8%, indicating substantial improvements over the satellite-only baseline.
Implications
The findings suggest that integrating satellite and meteorological data with a focus on temporal and vertical delays can significantly enhance soil moisture estimation, which is crucial for precision agriculture. This approach may lead to better water management practices and improved crop yields in semi-arid regions.
Superhuman AI for Generals.io Using Self-Play Reinforcement Learning
Reinforcement Learning
- Development of a high-speed JAX-native simulator for GENERALS.IO, enabling rapid training.
- Creation of a superhuman AI agent that dominates the public leaderboard and defeats top human players.
- Utilization of self-play reinforcement learning with a focus on sparse rewards and sample efficiency.
- Identification of key training components, such as parameter EMA and top-advantage filtering, that enhance performance.
Read more
Superhuman AI for Generals.io Using Self-Play Reinforcement Learning
Summary
This paper presents a superhuman AI agent for the real-time strategy game GENERALS.IO, which demands both long-term planning and short-term tactical decisions in an environment of imperfect information. The AI was trained over four days using 4 NVIDIA H200 GPUs and achieved the top position on the public 1v1 leaderboard, outperforming over 5,000 human players. The agent's success is attributed to a new JAX-native simulator capable of processing tens of millions of frames per second, significantly speeding up the training process. The authors employed a vision transformer policy trained through self-play using a policy-gradient approach, with a focus on sparse win/loss rewards, top-advantage sample filtering, and an exponential moving average of policy parameters. The findings emphasize the importance of a fast simulator in alleviating data bottlenecks and highlight the effectiveness of specific training strategies in achieving superhuman performance in complex strategic games.
Methodology
The authors utilized a JAX-native simulator to train a vision transformer policy through self-play reinforcement learning. The training involved a policy-gradient loop with sparse win/loss rewards, top-advantage sample filtering, and an exponential moving average of policy parameters to optimize performance.
Results
The AI agent achieved the #1 ranking on the public 1v1 leaderboard, surpassing thousands of human players. It recorded a head-to-head victory against the top two human players with a combined score of 199 wins to 70 losses across 269 matches. The study also demonstrated that the use of parameter EMA and selective sample filtering significantly improved training efficiency and effectiveness.
Implications
The findings suggest that with the right training environment and methodologies, it is possible to develop AI agents that can outperform human players in complex strategic games. This research could influence future developments in AI for gaming, multi-agent systems, and other applications requiring strategic decision-making under uncertainty.
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
NLP
Large Language Models
Efficient ML
- Formalizes adapter mergeability for LoRA, separating single-task utility from post-merge retention.
- Introduces MergeProbe, a lightweight predictor that estimates mergeability based on early training signals.
- Demonstrates improved retention in merging adapters across multiple domains compared to existing methods.
- Shifts the merging process from a post-hoc evaluation to an anticipatory measurement problem.
Read more
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
Summary
This paper addresses the challenge of merging low-rank adaptation (LoRA) updates for parameter-efficient fine-tuning of language models. The authors highlight that the mergeability of adapters, which are trained for specific tasks, is often only assessed after full training, leading to costly failures when adapters that perform well independently interfere with each other post-merge. To mitigate this, they formalize the concept of adapter mergeability, defining it as the ability of an adapter to maintain its utility after being merged with others. The authors propose a lightweight predictor called MergeProbe, which utilizes early training signals to forecast mergeability, allowing for proactive decisions on whether to merge, reweight, prune, or route adapters. They validate their approach using the MERGE-PEFT benchmark across five domains, demonstrating that MergeProbe outperforms existing interference-aware merge baselines while requiring significantly less overhead. This work transforms the merging process from a reactive to an anticipatory workflow, potentially enhancing the efficiency of deploying specialized language model adapters.
Methodology
The authors define mergeability in terms of single-task utility and post-merge retention, evaluating it at pairwise, adapter, and set levels. They identify measurable signals during early training that indicate potential conflicts when merging adapters. MergeProbe aggregates these signals to inform decisions on merging, reweighting, pruning, or routing adapters.
Results
MergeProbe achieved the best average and worst-case retention rates on the MERGE-PEFT benchmark, outperforming strong interference-aware merge baselines while incurring less deployment overhead. This indicates that the early training signals effectively predict mergeability, allowing for better management of adapter updates.
Implications
The findings suggest that proactive management of adapter merging can lead to more efficient deployment of language models in various applications, reducing the risk of performance degradation when combining specialized adapters. This approach could be particularly beneficial in environments where multiple task-specific models are maintained.
Dynamic estimation of slowly varying sequences
Theory
Efficient ML
Optimization
- Introduces a general framework for dynamic estimation of slowly varying sequences.
- Develops an algorithm that adapts the estimation budget based on local changes, improving efficiency.
- Achieves sharper estimation bounds compared to previous methods.
- Demonstrates applicability to various mathematical problems, including matrix powers and PDEs.
Read more
Dynamic estimation of slowly varying sequences
Summary
This paper addresses the challenge of sequentially approximating functions of elements in slowly varying sequences, where the differences between consecutive elements are small. The authors build upon recent advancements in implicit trace estimation, proposing a generalized framework applicable to various linear and nonlinear functions across different vector spaces. They introduce a novel algorithm that dynamically adjusts the estimation budget based on the local changes in the sequence, leading to improved efficiency in query complexity. The framework allows for sharper bounds on estimation costs, specifically O(Pm i=1 ฮฑi), compared to previous methods that relied on the worst-case scenario, O(m ยท maxi ฮฑi). Additionally, the authors present a mechanism for estimating changes on-the-fly, enhancing adaptability. The framework demonstrates broad applicability through theoretical results in areas such as matrix powers, spectral densities, and Monte Carlo integration, and is empirically validated through experiments on synthetic data and neural network optimization trajectories.
Methodology
The authors propose a flexible meta-algorithm for sequential stochastic approximation that utilizes a well-concentrated static estimator. The algorithm dynamically adjusts the number of queries based on local changes in the sequence, allowing for efficient estimation without prior knowledge of global bounds on changes. The framework requires a linear estimator with sub-exponential concentration properties, applicable to both linear and nonlinear mappings.
Results
The proposed framework results in a query complexity that scales with the sum of local changes, O(Pm i=1 ฮฑi), leading to more efficient estimation than previous methods. The empirical results show that the algorithm requires significantly fewer samples for dynamic trace estimation, validating its effectiveness across various applications.
Implications
This work has potential implications in fields requiring real-time estimation of evolving systems, such as machine learning, data streaming, and optimization of neural networks. The adaptability of the framework can enhance performance in applications like tracking metrics in social networks and estimating properties of dynamic data distributions.
Post-Training Speech Enhancement Language Models with Perceptual Rewards
Audio & Speech
Reinforcement Learning
Optimization
- Introduction of a post-training stage for autoregressive speech enhancement models using GSPO.
- Development of a composite reward system that combines multiple perceptual metrics to avoid reward hacking.
- Achieved state-of-the-art performance on DNS2020 and DNS5 benchmarks.
- Human evaluations indicate a preference for multi-metric rewards over single-metric approaches.
Read more
Post-Training Speech Enhancement Language Models with Perceptual Rewards
Summary
This paper addresses the limitations of current speech enhancement (SE) language models, which are primarily trained using token-level cross-entropy loss, failing to align with perceptual quality metrics used for evaluation. The authors propose a post-training stage utilizing Group Sequence Policy Optimization (GSPO) with multi-metric perceptual rewards to optimize models directly based on non-differentiable quality metrics such as DNSMOS, WER, and UTMOS. This approach avoids the pitfalls of single-metric optimization, which can lead to reward hacking. The authors apply their method to two autoregressive models, UniSE and GenSE, achieving state-of-the-art results on the DNS2020 benchmark. A human evaluation confirms that the composite multi-metric reward is preferred over single-metric variants, demonstrating the effectiveness of their approach in enhancing speech quality while maintaining robustness across different evaluation metrics.
Methodology
The authors implemented a post-training optimization stage using Group Sequence Policy Optimization (GSPO), which samples multiple outputs per input, scores them with a composite reward function, and applies policy gradient updates at the sequence level. This method directly utilizes perceptual quality metrics as reward signals without relying on learned surrogates or offline data construction.
Results
The proposed GSPO post-training approach led to significant improvements in speech enhancement performance, achieving state-of-the-art results on the DNS2020 benchmark. The composite reward system was shown to be more effective than single-metric optimization strategies, as confirmed by human evaluations.
Implications
This work has implications for the development of more effective speech enhancement systems that can better align with human perceptual quality assessments. The methodology could be applied to other areas of machine learning where evaluation metrics differ from training objectives, enhancing model performance across various applications.
Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression
Theory
Efficient ML
Optimization
- Introduces KORE, a method for directly solving for the optimal hyperparameter in spline regression.
- Establishes a closed-form relationship between resolution, bias, and variance based on classical approximation theory.
- Demonstrates that KORE matches the accuracy of exhaustive search methods while significantly reducing computational costs.
- Applies the method across multiple input dimensions and various datasets, showcasing its effectiveness in real-world scenarios.
Read more
Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression
Summary
This paper presents a novel approach to hyperparameter tuning in spline regression, proposing a method that eliminates the need for exhaustive search through hyperparameter space. The authors derive a closed-form solution for the optimal resolution of spline regression, leveraging classical approximation theory to establish a relationship between bias, variance, and the resolution parameter. The key insight is that the squared bias scales as G^(-2ฮฒ), where G is the resolution and ฮฒ is the smoothness exponent of the target function. The proposed method, KORE (Kolmogorov-optimal Order-aware Resolution Estimation), requires only a small number of model fits to determine the optimal resolution, significantly reducing computational costs compared to traditional grid search methods. The paper demonstrates that KORE achieves comparable accuracy to exhaustive cross-validation while fitting approximately eight times fewer models across various datasets, making it a highly efficient alternative for hyperparameter selection in spline regression.
Methodology
The authors utilize classical approximation theory to derive a closed-form expression for the optimal resolution in spline regression. They analyze the bias-variance tradeoff and develop KORE, which fits two pilot resolutions to calibrate the model and solve for the optimal resolution using a small number of fits instead of an exhaustive grid search.
Results
KORE was tested on 36 real tabular datasets and outperformed 21 other methods in terms of accuracy per unit of compute. It matched the performance of exhaustive 3-fold cross-validation and various model selection criteria while fitting approximately eight times fewer models, demonstrating its efficiency and effectiveness.
Implications
The findings suggest that KORE can streamline the hyperparameter tuning process in spline regression, making it more accessible and less resource-intensive. This approach could be extended to other model classes where similar analytical relationships exist, potentially transforming hyperparameter selection in machine learning.
How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
NLP
Large Language Models
Efficient ML
- Introduces a novel measure of linear recoverability for Transformer FFN blocks using closed-form least squares.
- Demonstrates that linear recoverability varies significantly across different FFN blocks and is a learned property rather than an architectural one.
- Finds that residual nonlinearity is not well captured by low-order multiplicative models.
- Highlights the potential for targeted compression of FFN blocks based on their linear recoverability profiles.
Read more
How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
Summary
This paper investigates the linearity of Transformer feed-forward networks (FFNs), which are often assumed to be nonlinear computational units. The author introduces a method to measure the linear recoverability of each FFN block by decomposing its input-output mapping into a linear approximation and a residual. The linear recoverability, quantified by the Rยฒ_lin metric, reveals significant variability across different FFN blocks in models like GPT-2, Pythia-160m, and llama-160m, indicating that linearity is not a fixed architectural trait but rather a learned characteristic of each block. The study finds that adjacent blocks can exhibit vastly different levels of linearity, and that the residual nonlinearity cannot be adequately explained by low-order multiplicative interactions. Furthermore, the findings suggest practical implications for model compression, as high linear recoverability allows for the replacement of FFN blocks with smaller linear layers without significant performance loss, while low-recoverability blocks indicate areas where such replacements may be detrimental. The paper emphasizes the importance of using exact closed-form measurements to assess linear recoverability, as naive trained linear baselines may misrepresent the true linearity of FFN blocks.
Methodology
The author treats each FFN block as a position-wise mapping and computes its best affine approximation using closed-form least squares. The residual is analyzed using a low-rank bilinear probe to assess the nature of the unrecovered computation. The study involves a depth survey of multiple pretrained models to evaluate the linear recoverability across different blocks.
Results
The analysis of twelve FFN blocks from three different models reveals that linear recoverability (Rยฒ_lin) varies widely, with some blocks being nearly linear (Rยฒ_lin > 0.99) while others are strongly nonlinear (Rยฒ_lin < 0.3). The residual analysis shows that unrecovered computations are not simply low-order multiplicative interactions, and the findings suggest that high recoverability can be leveraged for effective model compression.
Implications
The findings have significant implications for model compression strategies in Transformer architectures, suggesting that FFN blocks with high linear recoverability can be replaced with smaller, more efficient layers without sacrificing performance. This could lead to more efficient models that maintain competitive performance while reducing computational costs.
FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
Robotics
Computer Vision
Efficient ML
- Identification of a bottleneck trade-off in fixed-capacity LAMs affecting action alignment.
- Introduction of retained-prefix training for variable-length latent actions.
- FlexLAM outperforms traditional fixed-capacity LAMs across all evaluated token budgets.
- Supports inference-time token-budget adjustments without retraining.
Read more
FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
Summary
The paper introduces FlexLAM, a novel approach to Latent Action Models (LAMs) that addresses the limitations of fixed-capacity bottlenecks in latent action learning. Traditional LAMs use a fixed capacity for encoding transitions, which can lead to a trade-off where overly tight codes discard crucial transition cues, while overly loose codes introduce unnecessary variation that complicates action alignment, especially when labeled data is scarce. FlexLAM innovates by employing variable-length latent actions through a technique called retained-prefix training, which allows the model to generate valid latent actions of varying lengths based on the complexity of the transition. This method enables the model to adaptively capture essential transition structures while maintaining alignment with executable actions. The authors demonstrate that FlexLAM outperforms fixed-capacity LAMs across various token budgets in standard evaluations, indicating that it not only provides flexibility during inference but also enhances the learning of latent-action interfaces. The findings suggest that FlexLAM can serve as an architecture-free upgrade to existing latent action models, improving performance in tasks such as Ego4D transition reconstruction and facilitating better alignment in environments with limited labels.
Methodology
The authors propose FlexLAM, which modifies the training of latent actions by using retained-prefix training. This approach allows multiple prefixes of a transition code to be valid latent actions, enabling the model to adaptively adjust to varying complexities in transitions. The performance of FlexLAM is compared against separately trained fixed-capacity LAMs across different token budgets in a standard evaluation setup.
Results
FlexLAM consistently outperformed fixed-capacity LAMs at every evaluated token budget in DMLab tests. It demonstrated improved performance in transition reconstruction tasks and maintained effective alignment under conditions of scarce labeled data. The model also allowed for flexible adjustments in token budgets during inference without the need for retraining.
Implications
The findings suggest that FlexLAM can enhance the efficiency and effectiveness of latent action learning in various applications, particularly in environments where labeled data is limited. Its architecture-free nature allows for easy integration into existing systems, potentially improving performance in robotics, video analysis, and other domains reliant on action recognition from video data.
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Large Language Models
NLP
Theory
- Introduces Contagion Networks to measure evaluator bias propagation in multi-agent LLM systems.
- Establishes a Cross-Agent Contagion Matrix (ฮN) for quantifying bias spread across agents.
- Identifies three propagation regimes and demonstrates that homogeneous agents have weaker contagion effects.
- Finds that increasing evaluator committee size can reduce effective contagion by 72.4%.
Read more
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Summary
This paper introduces the concept of Contagion Networks to analyze how biases from evaluators in multi-agent large language model (LLM) systems propagate through an agent network. The study highlights the potential for systematic evaluation biases to influence the outputs of multiple agents, thereby affecting the overall performance and diversity of the system. Through a controlled experiment involving three agents with distinct evaluator bias profiles, the author constructs a Cross-Agent Contagion Matrix (ฮ3) to quantify bias propagation. The findings reveal that biases consistently spread among agents, with coefficients ranging from 0.157 to 0.352. The paper identifies three distinct propagation regimes based on the spectral radius of the contagion matrix, demonstrating that homogeneous-model agents exhibit weaker contagion compared to cross-model agents. Additionally, the research shows that increasing the size of the evaluator committee significantly reduces effective contagion, providing a practical strategy for mitigating bias propagation. The paper concludes by releasing an open-source framework for further exploration of bias dynamics in multi-agent systems.
Methodology
The study employs a controlled experiment with three LLM agents, each exhibiting different evaluator bias profiles. It constructs a Cross-Agent Contagion Matrix to quantify bias propagation and utilizes Test-Time Reinforcement Learning (TTRL) for agents to adapt their strategy weights based on evaluations from other agents. The contagion coefficients are calculated to measure the influence of one agent's evaluation on another's strategy distribution.
Results
The results indicate that evaluator biases propagate consistently across agents, with contagion coefficients ranging from 0.157 to 0.352. The study confirms the existence of a suppression regime for homogeneous-model agents, where bias propagation is significantly weaker than in cross-model scenarios. Furthermore, increasing the evaluator committee size from one to three reduces effective contagion by 72.4%, demonstrating a viable mitigation strategy.
Implications
The findings suggest that careful consideration of evaluator diversity in multi-agent systems can help maintain cognitive diversity and prevent systemic bias amplification. The open-source framework allows for further research into bias dynamics, potentially leading to improved designs for collaborative AI systems.
Generative Robust Optimisation
Optimization
Generative Models
Theory
- Introduces a framework for robust optimisation that uses deep generative models to define uncertainty sets.
- Establishes a five-point evaluation framework for assessing neural network-based uncertainty sets.
- Demonstrates the application of a Wasserstein Adversarial Autoencoder for generating uncertainty sets.
- Shows that the proposed method can effectively handle complex data distributions in optimisation problems.
Read more
Generative Robust Optimisation
Summary
The paper introduces Generative Robust Optimisation (GRO), a novel framework that enhances traditional robust optimisation by utilizing deep generative models to define uncertainty sets. Unlike classical methods that rely on fixed geometric shapes, GRO employs a neural network decoder to create uncertainty sets that can capture complex dependencies, such as nonlinear correlations and multimodal distributions, inherent in real-world data. The authors establish a five-point evaluation frameworkโcomprising reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractabilityโto systematically assess the performance of any neural network-based uncertainty set. The framework is instantiated using a Wasserstein Adversarial Autoencoder, which incorporates Gaussian mixture model-guided training and constraint-consistency regularisation. The use of ReLU activations in the decoder allows for exact worst-case verification through mixed-integer programming. Extensive experiments demonstrate the effectiveness of GRO across various uncertainty distributions and generative architectures, showing that careful consideration of the five evaluation criteria leads to uncertainty sets that are expressive, well-calibrated, and computationally tractable.
Methodology
The methodology involves defining the uncertainty set as the image of a trained neural network decoder, which is calibrated over a latent space. The framework employs a Wasserstein Adversarial Autoencoder with Gaussian mixture model-guided training and constraint-consistency regularisation to ensure robust relevance. The decoder's ReLU activations facilitate exact worst-case verification using mixed-integer programming.
Results
The experiments conducted on a production planning problem and a multi-period facility location study validate the GRO framework, demonstrating that it produces uncertainty sets that are expressive and well-calibrated while maintaining computational tractability. The systematic attention to the five evaluation criteria resulted in improved performance across various uncertainty distributions and generative architectures.
Implications
The GRO framework has significant implications for robust optimisation in various fields, particularly in scenarios where uncertainty is complex and cannot be adequately captured by traditional geometric shapes. It can be applied to production planning, resource allocation, and other decision-making processes that require robust solutions under uncertainty.
Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian
NLP
Large Language Models
Theory
- The proposed method allows for protein contact prediction in a single forward pass, significantly reducing computational cost compared to the Categorical Jacobian.
- Averaging a small subset of attention heads captures the relevant contact signal effectively, outperforming the CJ method on leakage-clean data.
- The optimal number of attention heads to average varies by architecture and reflects how models distribute contact information.
- The study introduces representation-CJ, extending the applicability of contact prediction methods to architectures without a masked-LM head.
Read more
Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian
Summary
This paper presents a novel approach to predicting protein contacts using attention mechanisms in protein language models (PLMs). The author critiques the Categorical Jacobian (CJ) method proposed by Zhang et al. (2024), which requires approximately 19L forward passes to read protein contacts by perturbing each residue with alternative amino acids. The study demonstrates that the relevant contact signal is already embedded within a small subset of attention heads, allowing for a more efficient single-forward-pass method. By averaging the top-K attention heads, selected from as few as 10 labeled proteins, the proposed method outperforms the CJ on leakage-clean data across various bidirectional models. The findings indicate that the selection of relevant heads, rather than the averaging process itself, is crucial for performance. Additionally, the paper introduces a representation-CJ (repr-CJ) for architectures lacking a masked-LM head and shows that both methods struggle with causal language models, suggesting that bidirectional pretraining is essential for capturing attention-encoded pair structures. The results highlight the potential for more efficient protein contact prediction without extensive supervised training.
Methodology
The study employs a comparative analysis between the Categorical Jacobian and a new method based on averaging the top-K attention heads from protein language models. The evaluation is conducted on leakage-clean datasets to ensure that results are not influenced by pretraining overlap. The optimal K value is determined based on a ranking derived from a small set of labeled proteins.
Results
The proposed method outperformed the Categorical Jacobian by +9 percentage points on the ESM-2-650M model in leakage-clean evaluations. The results were consistent across multiple architectures, with the top-K head averaging method showing superior performance in predicting protein contacts without the need for extensive supervised training.
Implications
This research suggests a more efficient approach to protein contact prediction, which could facilitate advancements in protein structure prediction and related biological applications. The findings may also influence the design of future protein language models and their training methodologies.
From Handcrafted Features to Functional Edge Learning: Evolution of EEG Seizure Detection Frameworks
Time Series
Interpretability
Efficient ML
- Deep Learning models for EEG analysis face significant challenges in clinical deployment due to their black-box nature and high data requirements.
- Kolmogorov-Arnold Networks (KANs) offer a new paradigm by using learnable activation functions, enhancing interpretability and efficiency.
- KANs are more robust to data scarcity and can facilitate cross-patient personalization without extensive retraining.
- The paper provides a structured analysis of EEG seizure detection methodologies, highlighting the need for transparent and efficient models.
Read more
From Handcrafted Features to Functional Edge Learning: Evolution of EEG Seizure Detection Frameworks
Summary
This paper reviews the evolution of EEG seizure detection frameworks, emphasizing the transition from traditional handcrafted feature-based methods to advanced deep learning (DL) architectures. While DL has improved automated EEG interpretation, its clinical application is hindered by issues such as lack of interpretability, high data requirements, and computational costs. The authors introduce Kolmogorov-Arnold Networks (KANs) as a promising alternative, which utilize flexible, learnable activation functions to enhance interpretability and efficiency. KANs address the limitations of conventional DL models by providing better parameter efficiency, interpretability, and robustness in data-scarce environments. The review systematically analyzes existing methodologies, identifies barriers to clinical deployment, and highlights the advantages of KANs, proposing them as a fundamental shift necessary for the development of patient-specific EEG monitoring systems.
Methodology
The paper conducts a comprehensive review of existing EEG seizure detection frameworks, comparing traditional machine learning approaches with deep learning models and introducing KANs. It systematically analyzes the limitations of current models and discusses the theoretical foundations and advantages of KANs in addressing these challenges.
Results
The review establishes that KANs not only improve predictive accuracy but also enhance interpretability and efficiency, making them suitable for clinical applications. The authors argue that KANs represent a paradigm shift necessary for the future of EEG monitoring systems.
Implications
The findings suggest that adopting KANs could lead to more reliable and interpretable EEG seizure detection systems, ultimately improving patient care and safety in clinical settings. This shift may also facilitate the integration of AI in routine medical practice, enhancing the capabilities of wearable and implantable devices.
Asymptotic Signal Subspace Recovery in Softmax Attention Models
Theory
- Establishes a theoretical framework for understanding attention mechanisms in noisy environments.
- Demonstrates that learned query vectors converge to the latent signal subspace under specific conditions.
- Connects stochastic learning dynamics with deterministic limits using dynamical systems theory.
- Provides insights into the positive-feedback mechanism of attention in identifying informative tokens.
Read more
Asymptotic Signal Subspace Recovery in Softmax Attention Models
Summary
This paper investigates the theoretical underpinnings of attention mechanisms, particularly in softmax attention models, which have shown empirical success in extracting relevant information from noisy data. The author studies a stylized softmax-attention model where a query vector is learned through stochastic gradient ascent from a mix of informative and nuisance tokens. By leveraging the model's symmetry, a population objective is derived, and the learning dynamics are characterized using tools from stochastic approximation and dynamical systems theory. The main contribution is the establishment of a rigorous connection between the stochastic learning algorithm and its deterministic limit, demonstrating that under certain high-dimensional scaling and step-size conditions, the learned query converges almost surely to the one-dimensional signal subspace defined by the latent informative direction. This convergence implies that the attention mechanism can effectively recover the latent signal amidst noise, providing a theoretical foundation for understanding attention as a signal extraction process in high-dimensional environments. The findings suggest a new perspective on how attention mechanisms can discover relevant information, reinforcing the positive-feedback loop between alignment and attention weight assignment.
Methodology
The author employs a stylized softmax-attention model and analyzes the learning dynamics through stochastic gradient ascent. The study utilizes concepts from stochastic approximation and dynamical systems to derive a population objective and characterize the optimization landscape, focusing on the alignment parameter governing attention behavior.
Results
The main result indicates that the learned query vector converges almost surely to the signal subspace spanned by the latent informative direction, effectively recovering the underlying signal amidst noise. This convergence occurs under high-dimensional scaling assumptions and standard step-size conditions, highlighting the robustness of the attention mechanism in signal extraction.
Implications
The findings provide a rigorous theoretical foundation for the design and analysis of attention mechanisms in machine learning, particularly in applications involving noisy data. This could enhance the understanding and development of models in various domains such as natural language processing and computer vision, where attention mechanisms are prevalent.
Robustness Cannot be Reduced to Regularization: Studying Adversarial Training Beyond the Linear Case
Theory
Optimization
Efficient ML
- Adversarial training is effective but computationally expensive.
- No equivalence between adversarial risk and regularized risk exists for two-layer networks.
- The impossibility of reformulating adversarial risk extends to deeper architectures.
- The study emphasizes the need for new methodologies in adversarial training beyond linear models.
Read more
Robustness Cannot be Reduced to Regularization: Studying Adversarial Training Beyond the Linear Case
Summary
This paper addresses the significant challenge of adversarial vulnerability in machine learning models, particularly focusing on the limitations of adversarial training beyond linear models. While adversarial training has proven effective, its high computational cost poses a barrier to practical implementation. Previous research has shown that for linear models, adversarial risk can be reformulated as a regularized risk, allowing for more efficient training. However, this paper demonstrates that such an equivalence does not hold for two-layer networks, a class of models that is more expressive than linear models. The authors provide formal proofs indicating that the adversarial risk cannot be simplified into a regularized form that exhibits weak data dependence. They further extend their analysis to deeper architectures like Wide-ResNets, providing empirical evidence that the impossibility of such reformulations persists. This work highlights the fundamental differences between adversarial risk and regularized risk, suggesting that new approaches are needed for robust training in complex models.
Methodology
The authors conducted a theoretical analysis of adversarial risk in two-layer networks, employing formal proofs to demonstrate the lack of equivalence with regularized risk. They also provided empirical evaluations on Wide-ResNets to support their theoretical findings.
Results
The main result is the formal proof that adversarial risk cannot be expressed as a simple regularized risk for two-layer networks. Empirical evidence suggests that this limitation persists in more complex architectures, indicating a fundamental challenge in adversarial training.
Implications
The findings suggest that current approaches to adversarial training may not scale effectively to more complex models, necessitating the development of new strategies for ensuring robustness in machine learning systems, particularly in safety-critical applications.
Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
Time Series
Efficient ML
Theory
- The study compares four neural network architectures for predicting internal battery states.
- U-Net architecture shows superior performance with a 3% mean final-step nRMSE.
- The proposed models significantly reduce inference latency, achieving a 5.38ร speed-up over traditional numerical solvers.
- Spatial inductive bias is identified as a critical factor influencing surrogate model performance.
Read more
Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
Summary
This paper presents a systematic comparison of four neural network architectures (MLP, ResNet, U-Net, FNO) designed as autoregressive state-transition operators for predicting internal states of lithium-ion batteries based on the Doyle-Fuller-Newman (DFN) model. The DFN model provides high-fidelity estimations of internal electrochemical states but is computationally intensive, making real-time applications challenging. The authors address this limitation by developing machine learning surrogates that can predict these states more efficiently. The study employs a unified training framework to ensure a controlled comparison of the architectures, focusing on their ability to generalize across various operating conditions. The results indicate that the U-Net architecture outperforms the others, achieving a mean final-step normalized root mean square error (nRMSE) of 3% after 300-step autoregressive rollouts, while also providing a significant speed-up of 5.38 times compared to the numerical solver. This research highlights the importance of spatial inductive bias in enhancing surrogate model performance, paving the way for improved battery management systems and digital twins.
Methodology
The authors formulated the problem as a discrete-time state-transition system, where the electrochemical state of the battery is represented as a state vector. They trained four different neural network architectures (MLP, ResNet, U-Net, FNO) under a unified framework using multi-step unrolling and current-conditioning to isolate the impact of spatial inductive bias on predictive accuracy and computational efficiency.
Results
The U-Net architecture achieved a mean final-step nRMSE of 3% across all internal state variables after 300-step autoregressive rollouts. It also provided a 5.38ร speed-up over the numerical DFN solver, demonstrating its effectiveness in real-time applications.
Implications
The findings suggest that the U-Net architecture can be effectively utilized in next-generation battery management systems and digital twins, enhancing internal state observability and operational efficiency in lithium-ion battery applications.
Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations
Theory
Interpretability
Optimization
- Introduces a physics-informed framework for yield function discovery from displacement and force data.
- Utilizes a convex neural network to represent yield functions, ensuring convexity and symmetry.
- Trains the neural yield function using force equilibrium residuals instead of direct stress supervision.
- Validated against benchmark yield functions using finite element simulations.
Read more
Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations
Summary
This paper addresses the challenge of identifying anisotropic yield functions in plasticity, which is complicated by the lack of direct stress observations and the need for multiple loading directions. The authors propose a physics-informed framework that discovers yield functions from full-field displacement and reaction force data, without requiring stress measurements or predefined parametric forms. The yield function is modeled using a convex neural network that enforces properties such as convexity and positive homogeneity, while also incorporating tension-compression symmetry. The neural yield function is trained using a differentiable stress update and a physics-informed force equilibrium loss across various loading cases. The framework is validated through finite element simulations against established yield functions (von Mises, Hill 1948, Yld2000-2d), demonstrating its effectiveness in accurately identifying yield contours and assessing performance under displacement noise and uncertainty. This study provides a novel approach for discovering anisotropic yield functions while maintaining the mechanical integrity required for elastoplastic stress integration.
Methodology
The authors developed a physics-informed framework that represents yield functions as convex neural networks. The training process involves minimizing the force equilibrium residuals derived from elastoplastic stress updates, using full-field displacement and reaction force data from multiple loading cases. The framework ensures that the yield function adheres to mechanical constraints while optimizing its parameters.
Results
The proposed framework successfully identified yield contours that aligned well with established yield functions (von Mises, Hill 1948, Yld2000-2d). It demonstrated robustness against displacement noise and effectively managed epistemic uncertainty, showcasing its potential for accurate yield function identification in anisotropic plasticity.
Implications
This research has significant implications for materials science and engineering, particularly in the development of more accurate constitutive models for complex materials. The ability to discover yield functions from accessible data could enhance predictive modeling in various applications, including structural analysis and material design.
Sakana Fugu Technical Report
Large Language Models
NLP
Reinforcement Learning
- Sakana Fugu combines the strengths of multiple LLMs to create a collectively intelligent system.
- Two model variants are introduced: Fugu for speed and Fugu-Ultra for high-quality answers.
- The training methodology includes fine-tuning, evolutionary algorithms, and reinforcement learning.
- Fugu models achieve state-of-the-art performance on various benchmarks, outperforming many existing models.
Read more
Sakana Fugu Technical Report
Summary
The Sakana Fugu Technical Report presents the development of Sakana Fugu, a family of orchestrator models designed to enhance the capabilities of frontier Large Language Models (LLMs) by combining their individual specializations into a collectively intelligent system. The Fugu models are trained to understand user queries and dynamically create agentic scaffolds to address them, leading to superior performance compared to individual LLM agents. Two variants are released: Fugu, optimized for speed and everyday use, and Fugu-Ultra, which prioritizes answer quality for complex tasks. The report details the training paradigm, which includes large-scale fine-tuning, evolutionary algorithms, and reinforcement learning, and emphasizes the importance of orchestrating multiple models to achieve higher efficiency and modularity in AI capabilities. The results demonstrate that Fugu models surpass publicly accessible frontier models across various challenging benchmarks, suggesting a new direction for AI development focused on model orchestration rather than solely on larger models.
Methodology
The methodology involves training Fugu models using large-scale fine-tuning, evolutionary algorithms, and reinforcement learning techniques. The models are designed to adaptively orchestrate a team of LLM agents, dynamically selecting the most suitable agent for each user query.
Results
Fugu and Fugu-Ultra models demonstrate superior performance on a range of benchmarks, including SWE-Bench Pro, Terminal Bench, and CharXiv Reasoning, often surpassing other publicly accessible models. The Fugu model balances performance and latency, while Fugu-Ultra focuses on high-quality answers for complex tasks.
Implications
The development of Sakana Fugu suggests that AI capabilities can be enhanced through the orchestration of existing models rather than solely relying on larger models. This approach could lead to more modular, efficient, and accessible AI systems, allowing for continuous integration of new models and improved user control over model selection.
Information Lattice Learning as Probabilistic Graphical Model Structure Learning
Theory
Interpretability
Graph Learning
- ILL provides a framework for learning interpretable rules from signals, emphasizing low complexity.
- The probabilistic rules learned through ILL can be interpreted as marginal constraints in PGMs.
- The information lattice structure aids in understanding the relationships between different abstractions.
- ILL distinguishes between general and special lifting, impacting the reconstruction of probability distributions.
Read more
Information Lattice Learning as Probabilistic Graphical Model Structure Learning
Summary
This paper presents Information Lattice Learning (ILL) as a method for learning interpretable rules from signals, particularly when the signal is a probability mass function. The authors argue that the probabilistic rules derived from ILL can be interpreted within the framework of probabilistic graphical models (PGMs). They detail how ILL constructs a hierarchy of abstractions through partitioning the signal space, leading to the identification of quotient variables and marginal laws. The paper distinguishes between general lifting, which encompasses all joint distributions satisfying learned constraints, and special lifting, which focuses on maximum-ignorance reconstructions. The authors clarify that while the information lattice is structured as a directed acyclic graph, it does not represent a Bayesian network but rather serves as a hypothesis space for graphical models. This perspective enhances the understanding of ILL in relation to PGMs and suggests new avenues for inference and hybrid symbolic-probabilistic learning.
Methodology
The authors introduce ILL as a process that begins with a signal and identifies human-interpretable abstractions through partitioning. They define rules as marginal distributions over these partitions and establish a connection to PGMs by interpreting these rules as constraints. The methodology involves projecting the signal onto a partition lattice and lifting the selected rules back to the signal domain, utilizing both general and special lifting techniques.
Results
The paper demonstrates that ILL can be effectively framed as a form of structure learning for PGMs, providing a clear interpretation of learned rules as marginal laws of quotient variables. The authors show that the learned rule sets correspond to a family of joint distributions, and the special lifting method yields a unique maximum-ignorance reconstruction of the probability distribution.
Implications
The findings suggest that ILL can be applied in various domains requiring interpretable knowledge discovery, such as scientific research, artistic endeavors, and enterprise applications. The connection to PGMs opens up possibilities for leveraging established graphical modeling techniques in downstream applications, enhancing inference and identifiability.
When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
Multimodal
- High AUC scores in probing do not guarantee effective detection of malicious content.
- The authors propose a candidate control set to improve evaluation methodologies.
- Two post-hoc diagnostics are introduced to differentiate between genuine and spurious detections.
- The study highlights the risks of shortcut learning in model evaluations.
Read more
When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
Summary
This paper addresses the limitations of using Area Under the Curve (AUC) metrics in evaluating the effectiveness of hidden-state probing for detecting indirect prompt injection (IPI) in multimodal computer-use agents. The authors present a cautionary case study using the Qwen2.5-VL-7B model in a teacher-forced replay protocol on the Mind2Web dataset. They argue that a high AUC score does not necessarily indicate reliable detection of malicious content, as it may be influenced by various surface artifacts and shortcut learning. To enhance evaluation methodologies, the authors propose a candidate control set that includes two post-hoc diagnostics: a paired-construction scalar baseline for text-side injections and visually-matched controls for overlay surfaces. These diagnostics help clarify what a high AUC score can and cannot infer about the presence of malicious content. The paper emphasizes the need for robust evaluation protocols that account for potential biases in model performance assessments.
Methodology
The authors conducted a case study using a linear logistic probe on the hidden-state features of the Qwen2.5-VL-7B model. They implemented a teacher-forced replay protocol on the Mind2Web dataset and introduced two post-hoc diagnostics to assess the validity of the AUC scores obtained. The diagnostics included a paired-construction scalar baseline for text injections and nuisance-matched visual controls.
Results
The study found that while the probing AUC reached near-perfect scores, it did not reliably indicate the model's ability to detect malicious content due to the influence of surface artifacts and shortcut learning. The proposed diagnostics revealed that the high AUC could mislead evaluators regarding the model's true capabilities.
Implications
The findings suggest that current evaluation methodologies for multimodal agents need to be revised to avoid over-reliance on AUC metrics. By incorporating the proposed control set and diagnostics, researchers can achieve a more accurate understanding of model performance in detecting IPI, ultimately leading to safer and more reliable multimodal applications.
New Smooth Loss functions for Robust Regression that Closely Approximate Absolute Error and Provide Improved Performance on Datasets With Significant Outliers
Optimization
Theory
- Introduction of two new loss functions (SRL and SMAE) for robust regression.
- Both proposed loss functions are infinitely differentiable and closely approximate MAE.
- Extensive empirical comparisons show superior performance of SRL and SMAE over traditional losses like Huber and Log-Cosh.
- The paper presents new robust linear regression models utilizing the proposed loss functions.
Read more
New Smooth Loss functions for Robust Regression that Closely Approximate Absolute Error and Provide Improved Performance on Datasets With Significant Outliers
Summary
This paper addresses the challenges posed by outliers in regression tasks, where traditional loss functions like Mean Squared Error (MSE) can lead to poor model performance due to their sensitivity to large errors. The authors propose two new loss functions: Square Root Loss (SRL) and Smooth Mean Absolute Error (SMAE), which are infinitely differentiable and closely approximate the Mean Absolute Error (MAE). These new loss functions aim to provide improved robustness against outliers while maintaining desirable mathematical properties such as strict convexity for SRL and quasi-convexity for SMAE. The paper presents an extensive empirical evaluation of these loss functions against established alternatives like Huber and Log-Cosh losses across various benchmarks. The results demonstrate that the proposed loss functions significantly enhance the performance of regression models, particularly in datasets with a high presence of outliers. Additionally, the authors introduce new robust linear regression models that leverage these loss functions, further contributing to the field of robust regression.
Methodology
The authors developed two new loss functions, SRL and SMAE, and conducted empirical evaluations across various datasets to compare their performance with existing loss functions. They also formulated robust linear regression models based on these new loss functions, ensuring that they maintain desirable properties such as convexity and differentiability.
Results
The proposed loss functions, SRL and SMAE, demonstrated superior performance in regression tasks with significant outliers, outperforming traditional loss functions like Huber and Log-Cosh in various benchmark datasets. The empirical results indicated improved accuracy and stability during training, validating the effectiveness of the new approaches.
Implications
The findings suggest that the new loss functions can be effectively utilized in machine learning applications where datasets are prone to outliers, potentially leading to more accurate predictive models in fields such as finance, healthcare, and any domain involving regression analysis.
Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
Multimodal
- cfDNA is a promising biomarker for non-invasive multi-cancer early detection.
- The review categorizes computational methods into statistical, machine learning, and deep learning approaches.
- Multimodal ensemble approaches are identified as having the highest readiness for clinical integration.
- Standardization of evaluation protocols is crucial for future research and comparison.
Read more
Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
Summary
This paper reviews the computational methods developed for analyzing cell-free DNA (cfDNA) in the context of multi-cancer early detection (MCED) from 2022 to 2025. The authors discuss the biological basis of cfDNA signals and the significance of fragmentomics and epigenetic features in cancer detection. They evaluate classical statistical methods, machine learning approaches, and deep learning frameworks, including autoencoder-based models, emphasizing their biological interpretability and clinical readiness. The review identifies key challenges in the field, including technical, computational, and methodological issues, and highlights the need for standardized evaluation protocols to facilitate future research and comparisons. The findings suggest that multimodal ensemble approaches show the most promise for clinical integration, with a strong emphasis on the importance of standardization in reporting results for better assessment of methodologies.
Methodology
The authors conducted a comprehensive review of computational methods for cfDNA analysis, focusing on fragmentomics and epigenetic features. They analyzed classical statistical methods, machine learning techniques, and deep learning frameworks, assessing their biological interpretability and clinical validation strategies.
Results
The review indicates that while various computational methods exist for cfDNA analysis, multimodal ensemble approaches are the most promising for clinical application. It also highlights the need for standardized protocols to improve the reliability and comparability of results across studies.
Implications
The findings suggest that advancements in cfDNA analysis could lead to significant improvements in early cancer detection, potentially transforming cancer screening practices and reducing treatment costs. The emphasis on standardization may enhance the clinical adoption of these technologies.