AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
VIMPO: Value-Implicit Policy Optimization for LLMs
Reinforcement Learning
Large Language Models
Optimization
- VIMPO is a critic-free policy optimization method that improves reasoning in LLMs.
- It derives a policy-implied value function using KL-regularized reinforcement learning principles.
- The method allows for token-level credit assignment without the instability of a learned critic.
- Empirical results show VIMPO outperforms existing methods like GRPO, especially under noisy rewards.
Read more
VIMPO: Value-Implicit Policy Optimization for LLMs
Summary
The paper introduces VIMPO (Value-Implicit Policy Optimization), a novel method for reinforcement learning that enhances the reasoning capabilities of large language models (LLMs) without the need for a critic. Traditional reinforcement learning approaches face a trade-off between simplicity and effective credit assignment. Actor-critic methods provide dense learning signals but require a learned value function, which can lead to training instability. Conversely, group-relative methods like GRPO simplify training by using trajectory-level advantages but lack fine-grained credit assignment. VIMPO addresses these challenges by deriving a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. This allows for a critic-free value optimization objective that incorporates outcome-level verifiable rewards. The method also provides a closed-form one-step temporal-difference advantage, enabling token-level credit assignment without a learned critic. Experimental results demonstrate that VIMPO outperforms GRPO on various benchmarks, particularly in competition-style evaluations, and shows resilience against noisy rewards, suggesting it can provide finer credit assignment while maintaining practical simplicity.
Methodology
VIMPO models autoregressive generation as a deterministic-transition Markov Decision Process (MDP) and derives a closed-form representation of the optimal value function. It uses a terminal boundary condition to create a critic-free value optimization objective, which trains a policy-implied value function. The method integrates a closed-form one-step temporal-difference advantage into a PPO-style actor update, enabling effective token-level credit assignment.
Results
VIMPO demonstrated significant improvements over the GRPO baseline across various mathematical RLVR benchmarks, including MATH-500, AIME 2024, AIME 2025, and OlympiadBench. It achieved faster training and higher validation accuracy, particularly excelling in competition-style evaluations. Under conditions of noisy rewards, VIMPO maintained a consistent advantage, indicating its robustness and effectiveness in credit assignment.
Implications
The development of VIMPO has potential applications in enhancing the performance of LLMs in complex reasoning tasks, such as mathematical problem solving and code generation. Its critic-free approach may simplify the training process while improving the model's ability to assign credit accurately to individual tokens, which is crucial for tasks requiring multi-step reasoning.
Spectral Retrieval-Augmented Time-Series Forecasting
Time Series
- Introduction of SpecReTF, a novel retrieval-augmented forecasting architecture.
- Combines frequency-domain analysis with recency-weighted pattern retrieval.
- Unified similarity measure integrates Jensen–Shannon divergence and cosine similarity.
- SpecReTF achieves state-of-the-art forecasting accuracy on benchmark datasets.
Read more
Spectral Retrieval-Augmented Time-Series Forecasting
Summary
This paper introduces SpecReTF, a novel retrieval-augmented time-series forecasting method that addresses the limitations of traditional forecasting approaches when dealing with complex, non-stationary patterns. Traditional methods often struggle to capture rare or complex patterns due to their reliance on learned representations, leading to issues such as spectral blindness and temporal recency. SpecReTF overcomes these challenges by converting time series into windowed frequency representations and employing a combined similarity metric that incorporates both amplitude and phase information. Additionally, it utilizes an exponential moving average weighting scheme to prioritize recent patterns over older data. Extensive experiments on benchmark datasets demonstrate that SpecReTF significantly outperforms existing time-domain retrieval methods, achieving superior forecasting accuracy across various non-stationary time series. The proposed method not only enhances the retrieval process by accurately capturing periodic behaviors but also maintains sensitivity to new patterns, thereby improving the overall forecasting performance.
Methodology
SpecReTF converts time series segments into the frequency domain using Short-time Fourier Transform (STFT). It computes a composite similarity score by combining Jensen–Shannon divergence for amplitude distributions with cosine similarity for phase alignment. An exponential moving average weighting scheme is applied to prioritize recent windows while retaining long-term patterns.
Results
The experiments conducted on multiple benchmark datasets show that SpecReTF consistently achieves superior forecasting accuracy compared to leading retrieval-based and purely model-based methods, establishing new state-of-the-art results in time-series forecasting.
Implications
The findings suggest that SpecReTF can be effectively applied in various domains requiring time-series forecasting, such as finance, energy consumption, and healthcare monitoring, where capturing non-stationary patterns is crucial for accurate predictions.
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
NLP
Large Language Models
Theory
- Latent Chain-of-Thought models face challenges due to weak learning signals from outcome supervision.
- The dual collapse phenomenon involves gradient attenuation and representational drift in latent spaces.
- Process supervision can be effectively decomposed into Trajectory and Space Supervision.
- Generative reconstruction is more effective than geometric compression for preserving information capacity.
Read more
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
Summary
This paper investigates the challenges of robust latent reasoning in Latent Chain-of-Thought (CoT) models, which represent reasoning through continuous hidden states instead of verbose discrete sequences. The authors identify a dual collapse phenomenon where gradient attenuation and representational drift hinder effective learning. They propose a framework that decomposes process supervision into two dimensions: Trajectory Supervision, which provides dense stepwise reasoning signals, and Space Supervision, which maintains the semantic structure of the latent space. The authors introduce the Unified Latent Probe (ULP) to measure the mutual information between latent trajectories and reasoning steps. Their empirical findings reveal a strong correlation between reasoning accuracy and information fidelity in the latent chain, suggesting a shift from geometric imitation to mutual information maximization for improved latent reasoning supervision.
Methodology
The authors conducted an information-theoretic analysis of Latent CoT, examining the effects of different supervision strategies. They introduced the Unified Latent Probe (ULP) to quantify mutual information and performed empirical experiments to assess the impact of Trajectory and Space Supervision on training stability and reasoning accuracy.
Results
The experiments demonstrated that process supervision significantly stabilizes training, with increased gradient magnitudes indicating effective adaptation. The study found that while geometric compression can collapse the reasoning manifold, generative reconstruction better preserves the information capacity, leading to improved reasoning performance.
Implications
The findings suggest a new framework for supervising latent reasoning in machine learning models, emphasizing the importance of mutual information maximization. This could lead to more effective training strategies for models that rely on latent reasoning, potentially enhancing their performance on complex reasoning tasks.
Deep Learning for Soil Moisture Estimation: Fusing Satellite Data with Optimally-Lagged Meteorological Features
Time Series
- Optimal meteorological and inter-depth lags were identified using Cross-Correlation Function (CCF).
- A per-pixel CNN model showed significant improvement in soil moisture prediction when combined with depth features.
- The CNN-LSTM hybrid model achieved the best overall performance in held-out data evaluation.
- Incorporating subsurface depth information was crucial for enhancing prediction accuracy.
Read more
Deep Learning for Soil Moisture Estimation: Fusing Satellite Data with Optimally-Lagged Meteorological Features
Summary
This paper addresses the challenge of accurately estimating soil moisture in semi-arid agricultural regions by integrating satellite data and meteorological information while accounting for the temporal delays in soil moisture response to atmospheric conditions. The authors introduce a Cross-Correlation Function (CCF) methodology to identify optimal lags (0–30 days) for meteorological variables and inter-depth lags (0–15 days) for vertical moisture propagation. The study evaluates three deep learning architectures: a per-pixel CNN for detailed estimation, an LSTM for daily plot-mean predictions, and a CNN-LSTM hybrid for pooled multi-patch training. The models were validated across seven agricultural plots in southeastern Spain, demonstrating that incorporating meteorological variables and subsurface depth information significantly enhances prediction accuracy. The results indicate that the CNN-LSTM hybrid achieved the highest performance (R² = 0.930, CVRMSE = 8%), showcasing the importance of modeling atmospheric and vertical delays for effective soil moisture estimation.
Methodology
The study employed a Cross-Correlation Function (CCF) to determine optimal temporal lags for meteorological variables and inter-depth lags for soil moisture. Three deep learning architectures were evaluated: a CNN for per-pixel estimation, an LSTM for daily plot-mean predictions, and a CNN-LSTM hybrid for pooled multi-patch training. The models were validated using a date-grouped split to prevent data leakage.
Results
The per-pixel CNN achieved a strong single-patch result (R² = 0.877, RMSE = 2.28), while the average R² across seven patches improved to 0.535 with depth features. The CNN-LSTM hybrid model outperformed all others with an R² of 0.930 and a CVRMSE of 8%, indicating substantial improvements over the satellite-only baseline.
Implications
The findings suggest that integrating satellite and meteorological data with a focus on temporal and vertical delays can significantly enhance soil moisture estimation, which is crucial for precision agriculture. This approach may lead to better water management practices and improved crop yields in semi-arid regions.
FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
Robotics
Computer Vision
Efficient ML
- Identification of a bottleneck trade-off in fixed-capacity LAMs affecting action alignment.
- Introduction of retained-prefix training for variable-length latent actions.
- FlexLAM outperforms traditional fixed-capacity LAMs across all evaluated token budgets.
- Supports inference-time token-budget adjustments without retraining.
Read more
FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
Summary
The paper introduces FlexLAM, a novel approach to Latent Action Models (LAMs) that addresses the limitations of fixed-capacity bottlenecks in latent action learning. Traditional LAMs use a fixed capacity for encoding transitions, which can lead to a trade-off where overly tight codes discard crucial transition cues, while overly loose codes introduce unnecessary variation that complicates action alignment, especially when labeled data is scarce. FlexLAM innovates by employing variable-length latent actions through a technique called retained-prefix training, which allows the model to generate valid latent actions of varying lengths based on the complexity of the transition. This method enables the model to adaptively capture essential transition structures while maintaining alignment with executable actions. The authors demonstrate that FlexLAM outperforms fixed-capacity LAMs across various token budgets in standard evaluations, indicating that it not only provides flexibility during inference but also enhances the learning of latent-action interfaces. The findings suggest that FlexLAM can serve as an architecture-free upgrade to existing latent action models, improving performance in tasks such as Ego4D transition reconstruction and facilitating better alignment in environments with limited labels.
Methodology
The authors propose FlexLAM, which modifies the training of latent actions by using retained-prefix training. This approach allows multiple prefixes of a transition code to be valid latent actions, enabling the model to adaptively adjust to varying complexities in transitions. The performance of FlexLAM is compared against separately trained fixed-capacity LAMs across different token budgets in a standard evaluation setup.
Results
FlexLAM consistently outperformed fixed-capacity LAMs at every evaluated token budget in DMLab tests. It demonstrated improved performance in transition reconstruction tasks and maintained effective alignment under conditions of scarce labeled data. The model also allowed for flexible adjustments in token budgets during inference without the need for retraining.
Implications
The findings suggest that FlexLAM can enhance the efficiency and effectiveness of latent action learning in various applications, particularly in environments where labeled data is limited. Its architecture-free nature allows for easy integration into existing systems, potentially improving performance in robotics, video analysis, and other domains reliant on action recognition from video data.
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Computer Vision
Large Language Models
Multimodal
- Physiology-aware CNN models outperform zero-shot multimodal LLMs in ECG image classification.
- LeadGroupECG model effectively captures anatomical relationships among ECG leads.
- CNN models achieved high ROC-AUC scores, indicating strong classification performance.
- Zero-shot LLMs showed near-chance performance, highlighting limitations in ECG interpretation.
Read more
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Summary
This study investigates the effectiveness of zero-shot multimodal large language models (LLMs) and physiology-aware convolutional neural networks (CNNs) in classifying 12-lead ECG images into normal and abnormal categories. The authors highlight the unique challenges of ECG image interpretation, which relies on precise waveform morphology and lead relationships, distinguishing it from general image classification tasks. They developed a novel CNN model, LeadGroupECG, designed to aggregate features from anatomical lead groups, and compared its performance against established CNN architectures (ResNet18, DenseNet121, VGG16) and three prominent LLMs (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) under zero-shot conditions. The study found that while CNN models achieved high classification accuracy (ROC-AUC of 0.92–0.94 internally and 0.85–0.86 externally), the LLMs performed poorly, with ROC-AUC scores around 0.5. The results suggest that despite the narrative generation capabilities of LLMs, their diagnostic performance in ECG interpretation remains limited, emphasizing the need for domain-specific architectures in clinical applications.
Methodology
The study utilized a large-scale public dataset of 12-lead ECG recordings rendered as single-page images for binary classification. The proposed LeadGroupECG model was developed to aggregate features from anatomical lead groups and was compared against baseline CNN models. Three LLMs were evaluated under fixed zero-shot prompts across multiple runs. All models were trained and tested using identical protocols on both internal and external datasets.
Results
CNN-based models demonstrated stable performance with internal ROC-AUC scores ranging from 0.92 to 0.94 and external scores between 0.85 and 0.86. The LeadGroupECG model significantly improved upon its backbone without sacrificing external generalization. In contrast, the zero-shot LLMs achieved ROC-AUC scores around 0.5, indicating poor classification ability.
Implications
The findings suggest that while multimodal LLMs can generate contextual narratives for ECG images, they are not reliable for diagnostic tasks without task-specific training. This highlights the necessity for clinically grounded, domain-specific architectures in AI-based ECG interpretation, which could enhance diagnostic accuracy and support clinical decision-making.
ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation
Large Language Models
Reinforcement Learning
NLP
- Incorrect student-generated outputs can provide more valuable training signals than correct ones in OPD.
- ReNIO introduces a prefix-computable reweighting method that emphasizes negative trajectories without needing final-answer labels.
- The method leverages student-to-teacher probability ratios to identify and weight pivotal tokens leading to incorrect reasoning.
- ReNIO shows substantial performance improvements in mathematical reasoning and code generation tasks.
Read more
ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation
Summary
The paper introduces ReNIO, a novel method aimed at enhancing on-policy distillation (OPD) for large language models (LLMs) by reweighting the importance of negative trajectories. Traditional OPD treats all student-generated outputs (SGOs) equally, which overlooks the potential value of incorrect SGOs. The authors conducted experiments revealing that training solely on incorrect SGOs often yields better performance than training on correct ones. This is attributed to the exploratory reasoning preserved in incorrect SGOs, which can provide valuable insights into the model's failure modes. ReNIO addresses the challenge of emphasizing informative negative trajectories without requiring full answer rollouts by employing a sample-level reweighting strategy based on the student-to-teacher probability ratio. This method identifies pivotal tokens that lead to incorrect reasoning and assigns them higher weights, thus enhancing the learning signal from negative trajectories while maintaining the efficiency of OPD. The results demonstrate that ReNIO significantly improves both OPD and on-policy self-distillation (OPSD) across mathematical reasoning and code generation tasks, showcasing its effectiveness in optimizing LLM training.
Methodology
ReNIO employs a sample-level reweighting approach that uses the ratio of student-to-teacher probabilities to identify pivotal tokens associated with incorrect reasoning. It aggregates these ratios to assign normalized weights to SGOs, emphasizing those that contain strong corrective signals while preserving the efficiency of OPD.
Results
ReNIO achieved relative performance gains of up to 8.90% for the Qwen3-1.7B model and 10.00% for the R1-Distill-Qwen-7B model on mathematical reasoning benchmarks, demonstrating its effectiveness in improving OPD and OPSD.
Implications
The findings suggest that incorporating negative trajectory information can enhance the training of LLMs, leading to more robust reasoning capabilities. This approach could be applied to various reasoning tasks and potentially improve the efficiency of training processes in LLMs.
Comparing Linear Probes with Mahalanobis Cosine Similarity
Interpretability
Theory
Large Language Models
- MCS provides a near-perfect linear prediction of OOD AUROC across multiple models and datasets.
- Theoretical proof establishes the linear relationship between MCS and OOD AUROC under specific conditions.
- MCS outperforms ECS significantly in terms of correlation with probe performance.
- The study identifies failure modes for the linearity of MCS and OOD AUROC, enhancing understanding of probe generalization.
Read more
Comparing Linear Probes with Mahalanobis Cosine Similarity
Summary
This paper investigates the relationship between linear probes and Mahalanobis cosine similarity (MCS), proposing MCS as a more effective measure for comparing linear probes in interpretability research. The authors extend previous findings that MCS correlates strongly with out-of-distribution (OOD) area under the receiver operating characteristic curve (AUROC), demonstrating this relationship across various models, layers, and datasets. They provide a theoretical framework explaining the linear relationship between MCS and OOD AUROC, showing that both metrics are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR). The study also identifies conditions under which this linearity may fail, verified through empirical tests. The findings suggest that MCS is a theoretically grounded and empirically robust alternative to traditional Euclidean cosine similarity (ECS) for evaluating linear probes.
Methodology
The authors employed logistic regression probes trained on in-distribution (ID) and out-of-distribution (OOD) datasets across various models (Llama-70B, Llama-8B, Qwen-7B) and layers. They calculated MCS using the covariance of the OOD data and compared it to the traditional ECS. The relationship between MCS and OOD AUROC was analyzed through empirical validation and theoretical derivation.
Results
The study found that MCS consistently predicts OOD AUROC with R2 values exceeding 0.93 across different models, layers, and datasets, while ECS showed significantly lower R2 values (as low as 0.06). The theoretical framework confirmed that under balanced classes and Gaussian projections, the relationship between MCS and OOD AUROC is linear, with empirical verification of conditions leading to deviations from this linearity.
Implications
The findings suggest that MCS can enhance the interpretability of machine learning models by providing a more reliable metric for comparing linear probes. This could lead to better understanding of model generalization and performance, particularly in applications requiring robust interpretability across varying datasets.
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Large Language Models
NLP
Theory
- Introduces Contagion Networks to measure evaluator bias propagation in multi-agent LLM systems.
- Establishes a Cross-Agent Contagion Matrix (ΓN) for quantifying bias spread across agents.
- Identifies three propagation regimes and demonstrates that homogeneous agents have weaker contagion effects.
- Finds that increasing evaluator committee size can reduce effective contagion by 72.4%.
Read more
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Summary
This paper introduces the concept of Contagion Networks to analyze how biases from evaluators in multi-agent large language model (LLM) systems propagate through an agent network. The study highlights the potential for systematic evaluation biases to influence the outputs of multiple agents, thereby affecting the overall performance and diversity of the system. Through a controlled experiment involving three agents with distinct evaluator bias profiles, the author constructs a Cross-Agent Contagion Matrix (Γ3) to quantify bias propagation. The findings reveal that biases consistently spread among agents, with coefficients ranging from 0.157 to 0.352. The paper identifies three distinct propagation regimes based on the spectral radius of the contagion matrix, demonstrating that homogeneous-model agents exhibit weaker contagion compared to cross-model agents. Additionally, the research shows that increasing the size of the evaluator committee significantly reduces effective contagion, providing a practical strategy for mitigating bias propagation. The paper concludes by releasing an open-source framework for further exploration of bias dynamics in multi-agent systems.
Methodology
The study employs a controlled experiment with three LLM agents, each exhibiting different evaluator bias profiles. It constructs a Cross-Agent Contagion Matrix to quantify bias propagation and utilizes Test-Time Reinforcement Learning (TTRL) for agents to adapt their strategy weights based on evaluations from other agents. The contagion coefficients are calculated to measure the influence of one agent's evaluation on another's strategy distribution.
Results
The results indicate that evaluator biases propagate consistently across agents, with contagion coefficients ranging from 0.157 to 0.352. The study confirms the existence of a suppression regime for homogeneous-model agents, where bias propagation is significantly weaker than in cross-model scenarios. Furthermore, increasing the evaluator committee size from one to three reduces effective contagion by 72.4%, demonstrating a viable mitigation strategy.
Implications
The findings suggest that careful consideration of evaluator diversity in multi-agent systems can help maintain cognitive diversity and prevent systemic bias amplification. The open-source framework allows for further research into bias dynamics, potentially leading to improved designs for collaborative AI systems.
Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations
Theory
Interpretability
Optimization
- Introduces a physics-informed framework for yield function discovery from displacement and force data.
- Utilizes a convex neural network to represent yield functions, ensuring convexity and symmetry.
- Trains the neural yield function using force equilibrium residuals instead of direct stress supervision.
- Validated against benchmark yield functions using finite element simulations.
Read more
Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations
Summary
This paper addresses the challenge of identifying anisotropic yield functions in plasticity, which is complicated by the lack of direct stress observations and the need for multiple loading directions. The authors propose a physics-informed framework that discovers yield functions from full-field displacement and reaction force data, without requiring stress measurements or predefined parametric forms. The yield function is modeled using a convex neural network that enforces properties such as convexity and positive homogeneity, while also incorporating tension-compression symmetry. The neural yield function is trained using a differentiable stress update and a physics-informed force equilibrium loss across various loading cases. The framework is validated through finite element simulations against established yield functions (von Mises, Hill 1948, Yld2000-2d), demonstrating its effectiveness in accurately identifying yield contours and assessing performance under displacement noise and uncertainty. This study provides a novel approach for discovering anisotropic yield functions while maintaining the mechanical integrity required for elastoplastic stress integration.
Methodology
The authors developed a physics-informed framework that represents yield functions as convex neural networks. The training process involves minimizing the force equilibrium residuals derived from elastoplastic stress updates, using full-field displacement and reaction force data from multiple loading cases. The framework ensures that the yield function adheres to mechanical constraints while optimizing its parameters.
Results
The proposed framework successfully identified yield contours that aligned well with established yield functions (von Mises, Hill 1948, Yld2000-2d). It demonstrated robustness against displacement noise and effectively managed epistemic uncertainty, showcasing its potential for accurate yield function identification in anisotropic plasticity.
Implications
This research has significant implications for materials science and engineering, particularly in the development of more accurate constitutive models for complex materials. The ability to discover yield functions from accessible data could enhance predictive modeling in various applications, including structural analysis and material design.
When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting
Time Series
Efficient ML
Theory
- Introduction of Guard framework for dynamic multi-teacher knowledge distillation.
- Adaptive mechanisms for selecting teacher models based on input statistics and uncertainty.
- Significant RMSE reduction compared to traditional distillation methods.
- Demonstrated effectiveness in four scientific domains despite distributional misalignment.
Read more
When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting
Summary
This paper addresses the challenges of deploying Time-Series Foundation Models (TSFMs) in scientific domains, particularly due to distributional misalignment and high computational costs. The authors propose a novel framework called Gated Uncertainty-Aware Routing for Distillation (Guard), which aims to extract latent structural knowledge from multiple misaligned foundation models to train lightweight, specialized forecasters. Guard employs two key mechanisms: a Contextual Router that selects the most relevant teacher model based on local input statistics, and an Uncertainty-Gated Temperature mechanism that adjusts the distillation strength based on the confidence of the teacher models. The framework is evaluated across four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. The results demonstrate that Guard significantly reduces RMSE compared to a fixed-weight multi-teacher distillation baseline, effectively distilling knowledge from pretrained models even in cases of suboptimal zero-shot accuracy. The findings indicate that domain-misaligned teachers can still provide valuable corrections, outperforming globally superior models in challenging instances, thus enabling high-precision forecasting suitable for resource-constrained environments.
Methodology
The Guard framework utilizes a two-pronged approach: a Contextual Router for dynamic teacher selection based on local input characteristics, and an Uncertainty-Gated Temperature mechanism to modulate the strength of knowledge distillation according to the reliability of the teacher models. This allows for instance-wise decision-making during the training process, enhancing the model's ability to adapt to diverse temporal dynamics.
Results
The evaluation of Guard showed a significant reduction in RMSE across various forecasting tasks compared to a fixed-weight multi-teacher distillation baseline. The framework successfully distilled knowledge from pretrained foundation models, even when they exhibited poor zero-shot performance due to distribution shifts. Notably, Guard outperformed globally superior models on 28.5% of the most challenging instances.
Implications
The findings suggest that Guard can facilitate the deployment of robust time-series forecasting models in scientific applications, particularly in resource-constrained settings such as edge-computing environments. This could enhance the monitoring and prediction capabilities in critical areas like meteorology and ecosystem management.
How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
NLP
Large Language Models
Efficient ML
- Introduces a novel measure of linear recoverability for Transformer FFN blocks using closed-form least squares.
- Demonstrates that linear recoverability varies significantly across different FFN blocks and is a learned property rather than an architectural one.
- Finds that residual nonlinearity is not well captured by low-order multiplicative models.
- Highlights the potential for targeted compression of FFN blocks based on their linear recoverability profiles.
Read more
How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
Summary
This paper investigates the linearity of Transformer feed-forward networks (FFNs), which are often assumed to be nonlinear computational units. The author introduces a method to measure the linear recoverability of each FFN block by decomposing its input-output mapping into a linear approximation and a residual. The linear recoverability, quantified by the R²_lin metric, reveals significant variability across different FFN blocks in models like GPT-2, Pythia-160m, and llama-160m, indicating that linearity is not a fixed architectural trait but rather a learned characteristic of each block. The study finds that adjacent blocks can exhibit vastly different levels of linearity, and that the residual nonlinearity cannot be adequately explained by low-order multiplicative interactions. Furthermore, the findings suggest practical implications for model compression, as high linear recoverability allows for the replacement of FFN blocks with smaller linear layers without significant performance loss, while low-recoverability blocks indicate areas where such replacements may be detrimental. The paper emphasizes the importance of using exact closed-form measurements to assess linear recoverability, as naive trained linear baselines may misrepresent the true linearity of FFN blocks.
Methodology
The author treats each FFN block as a position-wise mapping and computes its best affine approximation using closed-form least squares. The residual is analyzed using a low-rank bilinear probe to assess the nature of the unrecovered computation. The study involves a depth survey of multiple pretrained models to evaluate the linear recoverability across different blocks.
Results
The analysis of twelve FFN blocks from three different models reveals that linear recoverability (R²_lin) varies widely, with some blocks being nearly linear (R²_lin > 0.99) while others are strongly nonlinear (R²_lin < 0.3). The residual analysis shows that unrecovered computations are not simply low-order multiplicative interactions, and the findings suggest that high recoverability can be leveraged for effective model compression.
Implications
The findings have significant implications for model compression strategies in Transformer architectures, suggesting that FFN blocks with high linear recoverability can be replaced with smaller, more efficient layers without sacrificing performance. This could lead to more efficient models that maintain competitive performance while reducing computational costs.
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
NLP
Large Language Models
Computer Vision
- Introduces a scalable framework for linear mode connectivity in billion-parameter pretrained transformers.
- Utilizes parameterized weight transformations and a dual learning procedure for effective model merging.
- Achieves near-zero loss barriers on WikiText and maintains high accuracy on ImageNet during interpolation.
- Demonstrates the importance of resolving parameter symmetries in enhancing model connectivity.
Read more
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Summary
This paper addresses the challenge of merging independently trained neural networks, particularly billion-parameter pretrained transformers, by enhancing the concept of linear mode connectivity (LMC). The authors propose a novel framework that utilizes parameterized functionality-preserving weight transformations to align functionally equivalent solutions across models. A dual learning procedure is introduced, allowing both models to jointly learn transformations towards a shared linear interpolation path. This bidirectional optimization significantly reduces interpolation barriers, enabling reliable merging of large-scale architectures. Empirical results demonstrate that the proposed method achieves near-zero loss barriers on WikiText for medium-sized language models and maintains over 69% ImageNet top-1 accuracy for ViT-L throughout the interpolation path. The findings suggest that resolving parameter symmetries allows for effective connectivity and merging of large pretrained transformers through linear paths, improving interpolation performance.
Methodology
The authors developed a framework that includes a broad family of functionality-preserving weight transformations for transformers. These transformations are parameterized under structural constraints and optimized directly with respect to the loss along the interpolation path. The dual learning procedure allows both endpoint models to learn their respective symmetry transformations, facilitating a more effective alignment and reducing interpolation barriers.
Results
The proposed method achieved near-zero loss barriers on WikiText for medium-sized language models and maintained over 69% top-1 accuracy on ImageNet for ViT-L across the interpolation path. This indicates a significant improvement in interpolation performance and model merging reliability for large pretrained transformers.
Implications
The findings suggest that the proposed framework can enhance the scalability and effectiveness of model merging in large pretrained transformers, potentially leading to more efficient model reuse and composition in various applications such as natural language processing and computer vision.
Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding
Theory
- Introduces Causal Gaussian Processes (CGP) for evaluating treatment effects with unobserved confounding.
- Develops a universal discretization method for approximating causal models in continuous domains.
- Demonstrates the effectiveness of CGP in mitigating confounding bias in observational data.
- Provides a framework that requires only basic temporal ordering between treatment and outcome.
Read more
Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding
Summary
This paper addresses the challenge of evaluating causal treatment effects in the presence of unobserved confounding bias, which complicates the identification of causal effects from observational data. Traditional methods often require detailed prior knowledge or are limited to discrete treatments and outcomes. The authors propose a novel approach using Causal Gaussian Processes (CGP) that allows for robust treatment effect evaluation in continuous domains. They introduce a universal discretization method that approximates the observational and interventional distributions of any causal model with arbitrary accuracy using a finite number of latent states. This approach leverages the flexibility of Gaussian processes to model complex relationships while mitigating the impact of confounding bias. The paper demonstrates that the CGP models can effectively capture the underlying causal relationships even when faced with biased observational data, thus providing a more reliable framework for causal inference in practical applications.
Methodology
The authors propose a universal discretization technique that approximates the causal model's observational and interventional distributions using a finite number of latent states. They then develop Causal Gaussian Process models that utilize this discretization to learn from confounded observations, allowing for robust causal effect estimation.
Results
The results indicate that the proposed CGP models can accurately approximate the causal effects even in the presence of unobserved confounding, outperforming traditional Gaussian process regression methods that amplify biases. The paper provides empirical evidence demonstrating the effectiveness of the CGP approach in various scenarios.
Implications
This work has significant implications for policy evaluation and causal inference in fields where unobserved confounding is prevalent. The CGP framework can be applied in economics, healthcare, and social sciences, enabling more accurate assessments of treatment effects and informing better decision-making.
Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting
Time Series
Large Language Models
Generative Models
- Introduction of Diffusion-LLM framework that combines LLMs with conditional diffusion models for time series forecasting.
- Improvement in multimodal alignment and probabilistic modeling through a shared latent space.
- Significant performance gains in ultra-long-term and few-shot forecasting across multiple benchmarks.
- Demonstration of DDPMs as effective regularizers for enhancing LLM robustness.
Read more
Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting
Summary
This paper presents a novel framework called Diffusion-LLM, which integrates a conditional diffusion model into a Large Language Model (LLM) pipeline for time series forecasting. The authors address the limitations of existing LLMs in handling multimodal data and probabilistic modeling, particularly for ultra-long-term forecasting tasks. By embedding both inputs and targets into a shared latent space, the Diffusion-LLM framework enhances the model's ability to learn the conditional distribution of future data while improving semantic alignment. The methodology involves a Denoising Diffusion Probabilistic Model (DDPM) that acts as an implicit regularizer, allowing for better multimodal alignment and robust forecasting. The authors evaluate their approach on six long-term forecasting benchmarks, demonstrating significant improvements over existing LLM-based methods, particularly in ultra-long-term and few-shot forecasting scenarios. The results indicate that the integration of distribution-aware regularization enhances the robustness and generalization capabilities of LLMs in time series forecasting.
Methodology
The Diffusion-LLM framework employs a Denoising Diffusion Probabilistic Model (DDPM) integrated with a Large Language Model (LLM). It utilizes a reprogramming strategy to embed time series data into a shared token space, allowing for joint training of the LLM and DDPM. The DDPM estimates the conditional distribution of forecast embeddings based on the input lookback window, providing a distribution-aware signal that regularizes the LLM.
Results
The proposed Diffusion-LLM framework outperformed existing LLM-based baselines across six long-term forecasting benchmarks, including ETT, Weather, and ECL. The results showed notable improvements in ultra-long-term forecasting capabilities and enhanced performance in few-shot learning scenarios, demonstrating the effectiveness of distribution-aware regularization.
Implications
The findings suggest that integrating diffusion models with LLMs can significantly enhance the robustness and generalization of time series forecasting models. This approach could be applied in various domains requiring long-term predictions, such as energy systems, healthcare, and climate science, where accurate forecasting from limited historical data is critical.
Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization
Reinforcement Learning
- ROVER maximizes state-space coverage for effective exploration in sparse-reward environments.
- The method employs a learned resolvent world model to estimate occupancy, addressing common estimation challenges.
- Introduction of a virtual 'sink' state stabilizes learning by managing unsupported state-action regions.
- Empirical results show ROVER achieves superior coverage and initialization compared to traditional reward-free methods.
Read more
Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization
Summary
This paper addresses the challenge of sparse rewards in reinforcement learning (RL) by proposing a novel method for pretraining exploration policies without the need for reward signals. The authors introduce ROVER (Reward-free pretraining via Occupancy coVERage maximization), which aims to maximize state-space coverage through an occupancy measure. This method is framed as an entropy maximization problem and utilizes a learned resolvent world model to estimate occupancy, thus avoiding common issues in density and entropy estimation. A key innovation of ROVER is the introduction of a virtual 'sink' state that helps balance the exploration of known states with the expansion into unexplored regions, preventing cyclic behaviors during learning. The authors demonstrate that ROVER outperforms standard reward-free baselines in both tabular and pixel-based sparse navigation tasks, leading to more uniform coverage and better initializations for downstream tasks. Overall, this work provides a robust framework for reward-free policy pretraining that is particularly beneficial in multi-task, meta-, and continual learning scenarios where rewards are sparse or unavailable.
Methodology
The authors formalize occupancy coverage as a target-free objective in Reproducing Kernel Hilbert Space (RKHS) and implement it in ROVER. The algorithm learns a representation, estimates the kernel mean embedding of the occupancy measure, and computes policy gradients for its squared norm. The policy improvement is achieved through policy mirror descent, with the addition of a virtual sink state to manage unsupported regions.
Results
Experiments conducted in both tabular and pixel-based sparse navigation tasks demonstrate that ROVER produces more uniform aggregate coverage and provides stronger initializations for downstream tasks compared to standard reward-free baselines.
Implications
The findings suggest that ROVER could significantly enhance the training of reinforcement learning agents in environments with sparse rewards, making it particularly useful for applications in multi-task and continual learning scenarios. This method could lead to more efficient exploration strategies and improved performance in complex RL tasks.
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Time Series
- A systematic evaluation of forecasting models for influenza using ILI and hospitalization data.
- Mixture-of-experts models outperform other architectures, indicating the benefit of diverse pretrained representations.
- Numerical transformer-based models are reliable, especially with appropriate pretraining.
- LLM-based forecasting methods are less effective compared to traditional numerical approaches.
Read more
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Summary
This paper addresses the critical need for accurate short-term forecasting of seasonal influenza, which affects millions and poses significant public health challenges in the U.S. The authors conduct a systematic evaluation of various forecasting models using influenza-like illness (ILI) and hospitalization time series data. They compare classical neural networks, transformer-based models, pretrained time series foundation models, and large language model (LLM)-based approaches under both temporal and spatial generalization settings for 1-4 week ahead predictions. The study finds that a mixture-of-experts model, which integrates multiple pretrained forecasters, yields the best performance, highlighting the value of heterogeneous pretrained representations. Additionally, numerical transformer-based models demonstrate reliability, particularly when pretrained on data aligned with influenza dynamics. The study also reveals that LLM-based methods underperform compared to numerical forecasters. The incorporation of hospitalization data as an auxiliary covariate enhances forecasting robustness in certain scenarios. Overall, the findings provide actionable insights for model selection, pretraining strategies, and the use of auxiliary signals in influenza surveillance and preparedness.
Methodology
The authors compiled and standardized weekly ILI and hospitalization time series data at the U.S. HHS-region level. They evaluated 17 deep forecasting models under two generalization regimes: temporal (within-region) and spatial (across-region), focusing on 1-4 week ahead predictions. The evaluation employed consistent preprocessing and training pipelines, reporting metrics such as Mean Squared Error (MSE) and Normalized Root Mean Squared Error (NNSE).
Results
The study demonstrated that the mixture-of-experts model achieved the highest performance across forecasting tasks. Numerical transformer-based models provided reliable forecasts, particularly benefiting from pretraining aligned with influenza dynamics. In contrast, LLM-based methods showed inferior performance. The use of hospitalization data as an auxiliary covariate led to improvements in specific forecasting scenarios.
Implications
The findings suggest that public health agencies can enhance their influenza forecasting efforts by selecting appropriate models and leveraging auxiliary data. The insights gained from this study can inform vaccination strategies, hospital resource allocation, and overall epidemic preparedness.
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Large Language Models
Efficient ML
NLP
- UltraQuant improves 4-bit KV caching for context-heavy agents, addressing memory pressure and GPU utilization.
- The method incorporates practical design choices, including asymmetric key/value treatment and optimized decode-attention kernels.
- UltraQuant achieves a 3.47× reduction in time-to-first-token during late rounds and a 1.63× increase in output throughput over FP8 KV caching.
- The approach emphasizes the importance of serving efficiency metrics in evaluating KV-cache performance.
Read more
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Summary
The paper presents UltraQuant, a novel approach to 4-bit key-value (KV) caching designed for context-heavy agents that require efficient memory management during multi-turn interactions. The authors identify the challenges posed by long context lengths and high concurrency in serving systems, which can lead to inefficient GPU utilization. They propose a framework for 4-bit KV caching that balances task quality, cache residency, and serving throughput. Key contributions include the introduction of practical design choices for robust 4-bit caching, optimizations for AMD GPUs, and a focus on serving efficiency metrics. The UltraQuant method employs TurboQuant-style rotation and codebook quantization while introducing FP4 approximations to enhance performance. Experimental results demonstrate that UltraQuant significantly reduces time-to-first-token (TTFT) in cache-pressured scenarios and improves output throughput compared to the FP8 KV baseline, showcasing its effectiveness in managing long-context workloads.
Methodology
The authors utilize TurboQuant-style rotation and codebook quantization as a foundation for their 4-bit KV caching approach. They implement practical design choices such as asymmetric treatment of keys and values, Walsh-Hadamard rotation, and block-scale variants. The UltraQuant method leverages optimized decode-attention kernels and FP4 approximations on AMD GPUs to enhance performance and reduce latency.
Results
UltraQuant demonstrates a 3.47× improvement in time-to-first-token (TTFT) during late rounds of multi-turn interactions and a 2.3× improvement across all rounds compared to the FP8 KV baseline. Additionally, it increases output throughput by 1.63×, particularly benefiting from better cache residency under high concurrency.
Implications
The findings suggest that UltraQuant can significantly enhance the performance of context-heavy agents in applications requiring long-running memory and high concurrency, such as conversational AI and complex task execution. This approach could lead to more efficient use of GPU resources and improved user experiences in interactive systems.
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Theory
Efficient ML
- Introduces a lightweight defense mechanism against FDIA in DNNs used in CPS.
- Utilizes pseudo-feature padding to increase input dimensionality and complexity.
- Model-agnostic approach requiring no modifications to existing DNN architectures.
- Demonstrates significant improvements in robustness with minimal impact on performance.
Read more
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Summary
This paper addresses the vulnerability of Deep Neural Networks (DNNs) in Cyber-Physical Systems (CPS), particularly in the context of False Data Injection Attacks (FDIA) that can disrupt critical operations like state estimation in power grids. The authors propose a novel defense framework called Pseudo-Feature Padding, which introduces an additional input layer that pads input samples with pseudo-feature values derived from the statistical distribution of the input data. This method increases the input dimensionality in a randomized and data-aware manner, making it significantly harder for adversaries to generate effective attacks. The approach is lightweight, model-agnostic, and does not require changes to the core architecture of existing DNNs, facilitating easy deployment in real-world settings. The framework was evaluated using various IEEE test systems (14-bus, 30-bus, 118-bus, and 300-bus) for state estimation, demonstrating that it enhances model robustness against FDIA while maintaining performance integrity with negligible accuracy drop compared to baseline models.
Methodology
The proposed framework integrates an additional input layer that pads input samples with pseudo-feature values. These values are dynamically generated based on the statistical distribution of the input data, identified through tree-based models. The padding is randomized during inference, increasing data diversity and adversarial uncertainty, which complicates the generation of effective FDIA samples.
Results
The evaluation of the proposed framework on the IEEE test systems showed that the pseudo-feature padding significantly improved the robustness of DNNs against FDIA. The method maintained the performance integrity of the models with a negligible drop in accuracy, outperforming conventional defense techniques that failed to mitigate sophisticated FDIA samples.
Implications
The lightweight and model-agnostic nature of the proposed defense framework makes it highly applicable in real-world CPS environments, where securing all sensors is impractical. This approach can enhance the security of critical infrastructure systems against adversarial attacks, ensuring more reliable operation.
Information Lattice Learning as Probabilistic Graphical Model Structure Learning
Theory
Interpretability
Graph Learning
- ILL provides a framework for learning interpretable rules from signals, emphasizing low complexity.
- The probabilistic rules learned through ILL can be interpreted as marginal constraints in PGMs.
- The information lattice structure aids in understanding the relationships between different abstractions.
- ILL distinguishes between general and special lifting, impacting the reconstruction of probability distributions.
Read more
Information Lattice Learning as Probabilistic Graphical Model Structure Learning
Summary
This paper presents Information Lattice Learning (ILL) as a method for learning interpretable rules from signals, particularly when the signal is a probability mass function. The authors argue that the probabilistic rules derived from ILL can be interpreted within the framework of probabilistic graphical models (PGMs). They detail how ILL constructs a hierarchy of abstractions through partitioning the signal space, leading to the identification of quotient variables and marginal laws. The paper distinguishes between general lifting, which encompasses all joint distributions satisfying learned constraints, and special lifting, which focuses on maximum-ignorance reconstructions. The authors clarify that while the information lattice is structured as a directed acyclic graph, it does not represent a Bayesian network but rather serves as a hypothesis space for graphical models. This perspective enhances the understanding of ILL in relation to PGMs and suggests new avenues for inference and hybrid symbolic-probabilistic learning.
Methodology
The authors introduce ILL as a process that begins with a signal and identifies human-interpretable abstractions through partitioning. They define rules as marginal distributions over these partitions and establish a connection to PGMs by interpreting these rules as constraints. The methodology involves projecting the signal onto a partition lattice and lifting the selected rules back to the signal domain, utilizing both general and special lifting techniques.
Results
The paper demonstrates that ILL can be effectively framed as a form of structure learning for PGMs, providing a clear interpretation of learned rules as marginal laws of quotient variables. The authors show that the learned rule sets correspond to a family of joint distributions, and the special lifting method yields a unique maximum-ignorance reconstruction of the probability distribution.
Implications
The findings suggest that ILL can be applied in various domains requiring interpretable knowledge discovery, such as scientific research, artistic endeavors, and enterprise applications. The connection to PGMs opens up possibilities for leveraging established graphical modeling techniques in downstream applications, enhancing inference and identifiability.
Post-Training Speech Enhancement Language Models with Perceptual Rewards
Audio & Speech
Reinforcement Learning
Optimization
- Introduction of a post-training stage for autoregressive speech enhancement models using GSPO.
- Development of a composite reward system that combines multiple perceptual metrics to avoid reward hacking.
- Achieved state-of-the-art performance on DNS2020 and DNS5 benchmarks.
- Human evaluations indicate a preference for multi-metric rewards over single-metric approaches.
Read more
Post-Training Speech Enhancement Language Models with Perceptual Rewards
Summary
This paper addresses the limitations of current speech enhancement (SE) language models, which are primarily trained using token-level cross-entropy loss, failing to align with perceptual quality metrics used for evaluation. The authors propose a post-training stage utilizing Group Sequence Policy Optimization (GSPO) with multi-metric perceptual rewards to optimize models directly based on non-differentiable quality metrics such as DNSMOS, WER, and UTMOS. This approach avoids the pitfalls of single-metric optimization, which can lead to reward hacking. The authors apply their method to two autoregressive models, UniSE and GenSE, achieving state-of-the-art results on the DNS2020 benchmark. A human evaluation confirms that the composite multi-metric reward is preferred over single-metric variants, demonstrating the effectiveness of their approach in enhancing speech quality while maintaining robustness across different evaluation metrics.
Methodology
The authors implemented a post-training optimization stage using Group Sequence Policy Optimization (GSPO), which samples multiple outputs per input, scores them with a composite reward function, and applies policy gradient updates at the sequence level. This method directly utilizes perceptual quality metrics as reward signals without relying on learned surrogates or offline data construction.
Results
The proposed GSPO post-training approach led to significant improvements in speech enhancement performance, achieving state-of-the-art results on the DNS2020 benchmark. The composite reward system was shown to be more effective than single-metric optimization strategies, as confirmed by human evaluations.
Implications
This work has implications for the development of more effective speech enhancement systems that can better align with human perceptual quality assessments. The methodology could be applied to other areas of machine learning where evaluation metrics differ from training objectives, enhancing model performance across various applications.
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Multimodal
- Introduces a novel multimodal approach combining 3D MRI and PET for AD diagnosis.
- Utilizes three fusion strategies and a sparsely gated Mixture-of-Experts classifier.
- Achieves high classification accuracies across multiple diagnostic tasks.
- Implements Grad-CAM for enhanced model interpretability.
Read more
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Summary
This paper addresses the critical need for early diagnosis of Alzheimer's Disease (AD) by leveraging multimodal neuroimaging data, specifically 3D MRI and PET scans. The authors highlight the limitations of existing models that typically use static concatenation of MRI and PET data, which can hinder robustness and computational efficiency. To overcome these challenges, the study introduces a novel approach that combines 3D convolutional feature extractors with three fusion strategies: concatenation, Gated Multimodal Unit (GMU), and gated self-attention. Additionally, a sparsely gated Mixture-of-Experts (MoE) classifier is employed to dynamically route inputs to the most relevant experts, enhancing model adaptability to patient heterogeneity. The model's interpretability is further improved through the use of Grad-CAM for visualizing disease-related brain regions. The methodology is tested across three binary classification tasks: Normal Cognition (NC) vs. Mild Cognitive Impairment (MCI), MCI vs. AD, and NC vs. AD. The results demonstrate that the GMU achieves accuracies of 80.46% for NC vs. MCI and 95.47% for NC vs. AD, while gated self-attention reaches 82.08% for MCI vs. AD. Ablation studies confirm the importance of the MoE in maintaining high accuracy across all tasks, underscoring the effectiveness of input-adaptive multimodal modeling in AD diagnosis.
Methodology
The study employs a series of preprocessing steps on MRI and PET images, followed by feature extraction using a 3D convolutional neural network (CNN). Three fusion techniques (concatenation, GMU, and gated self-attention) are applied to capture inter- and intra-modal interactions. A Mixture-of-Experts model is integrated for dynamic routing of inputs, and Grad-CAM is used for visualizing decision-making processes.
Results
The proposed model achieves accuracies of 80.46% for NC vs. MCI, 95.47% for NC vs. AD, and 82.08% for MCI vs. AD. Ablation studies indicate that removing the MoE consistently degrades performance across all tasks, highlighting its significance in the model's architecture.
Implications
The findings suggest that integrating multimodal neuroimaging data with adaptive modeling techniques can significantly enhance the early diagnosis of Alzheimer's Disease, potentially leading to better patient outcomes through timely interventions.
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
Robotics
Efficient ML
Multimodal
- VLA-FAIL is a lightweight framework for detecting task failures in Vision-Language-Action models.
- It combines two novel detection methods: LLMD for out-of-distribution state detection and ACC for action consistency monitoring.
- The framework requires no failure data and incurs minimal computational overhead.
- AUCPDT is introduced as a new metric to evaluate detection accuracy and latency.
Read more
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
Summary
The paper introduces VLA-FAIL, a novel framework designed for efficient task failure detection in finetuned Vision-Language-Action (VLA) models, which are known for their state-of-the-art performance in robotic manipulation tasks. Despite their capabilities, VLAs can exhibit unpredictable behavior in out-of-distribution scenarios, making runtime failure detection crucial for safe deployment. Existing methods often rely on computationally expensive action sampling or require failure data, which can be impractical. VLA-FAIL addresses these challenges by combining two lightweight failure detection techniques: Last-Layer Mahalanobis Distance (LLMD) and Action Chunk Consistency (ACC). LLMD measures deviations in last-layer features from the training data to detect out-of-distribution states, while ACC assesses the consistency of action chunks over time to identify erratic behavior. The framework is designed to operate with minimal computational overhead and does not require access to failure data. The authors also introduce a new evaluation metric, AUCPDT, which captures the trade-off between detection accuracy and latency. Through extensive experiments in both real-world and simulated environments, VLA-FAIL demonstrates robust performance, often surpassing more complex baseline methods in early and reliable failure detection across various tasks.
Methodology
The methodology involves two main components: Last-Layer Mahalanobis Distance (LLMD), which detects out-of-distribution states by analyzing token-wise deviations in last-layer features, and Action Chunk Consistency (ACC), which evaluates the consistency of overlapping action chunks in a receding-horizon control framework. The combination of these two methods allows for effective monitoring of task execution without the need for failure data.
Results
The results indicate that VLA-FAIL effectively captures complementary failure modes, leading to reliable and early detection of task failures across six diverse manipulation tasks. The framework frequently outperforms significantly more expensive baseline methods, demonstrating its efficiency and robustness.
Implications
The implications of this work extend to the safe deployment of VLA models in real-world robotic applications, where early detection of failures can facilitate human intervention and improve overall system reliability. The lightweight nature of VLA-FAIL makes it suitable for real-time applications in robotics.
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
Generative Models
Reinforcement Learning
Multimodal
- DiT-Reward effectively repurposes a pretrained text-to-image DiT as a reward model.
- The method outperforms existing models like HPSv3 on multiple preference benchmarks.
- A lightweight head can extract meaningful preference predictions even when the generative backbone is frozen.
- Reward performance benefits from representations in the middle-to-late layers of the transformer.
Read more
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
Summary
The paper introduces DiT-Reward, a novel approach that repurposes a pretrained text-to-image Diffusion Transformer (DiT) as a reward model for evaluating generated images. The authors explore whether the representations learned during image generation can also be utilized for reward prediction, thereby enhancing the evaluation of generated outputs. DiT-Reward processes near-clean image latents and aggregates text-conditioned image representations across transformer layers. The method demonstrates superior performance compared to existing models, specifically HPSv3, on multiple preference benchmarks, achieving notable scores of 85.6% on HPDv2 and 77.6% on HPDv3. The findings indicate that even with a frozen generative backbone, a lightweight learned head can effectively predict preferences. The study also reveals that the most effective reward performance is derived from the middle-to-late layers of the transformer, and that increasing the generative backbone's capacity consistently improves results. Additionally, when applied to optimize Stable Diffusion 3.5 Large, DiT-Reward shows significant improvements in realism and achieves a 1.65× speedup in inference over HPSv3, while maintaining comparable memory usage. Overall, the research highlights the potential of pretrained generative models in reward modeling and policy optimization.
Methodology
DiT-Reward converts a pretrained text-to-image Diffusion Transformer into a reward model by encoding input images into a latent space, applying near-clean perturbations, and extracting text-conditioned image token representations across transformer layers. A lightweight MLP is used to map pooled features to scalar rewards. The model is evaluated on preference benchmarks and compared against existing reward models.
Results
DiT-Reward outperforms HPSv3 on all evaluated benchmarks, achieving 85.6% on HPDv2 and 77.6% on HPDv3. The method shows that even with a frozen backbone, it can still provide meaningful preference predictions. Additionally, it demonstrates a 1.65× speedup in inference time compared to HPSv3 while maintaining similar peak memory usage.
Implications
The findings suggest that pretrained generative models can be effectively utilized for reward modeling, potentially improving the evaluation and optimization of generative models in text-to-image tasks. This approach may lead to more efficient and effective reinforcement learning strategies in multimodal contexts.
Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
Theory
Efficient ML
- Introduction of QCPIKAN, the first quantum-classical physics-informed Kolmogorov-Arnold network for PDEs.
- Theoretical proof of accelerated convergence rates and reduced numerical dispersion.
- Validation across three seepage scenarios in porous media demonstrates superior performance.
- Outperforms existing models in accuracy, error control, and dynamic tracking.
Read more
Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
Summary
This paper introduces QCPIKAN, a novel quantum-classical physics-informed Kolmogorov-Arnold network designed to effectively solve partial differential equations (PDEs). The framework integrates Chebyshev-polynomial KAN layers with parameterized quantum circuits, embedding physical constraints directly into the training loss to ensure physical consistency. The authors provide theoretical foundations based on approximation theory, demonstrating that this architecture significantly accelerates the convergence of high-frequency errors to an exponential rate while reducing numerical dispersion. The performance of QCPIKAN is validated through three typical seepage scenarios in porous media: single-phase flow, component transport, and two-phase flow. Results indicate that QCPIKAN outperforms existing quantum-classical physics-informed neural networks in terms of global prediction accuracy, local error control, dynamic evolution tracking, and displacement front localization. This work presents a robust and efficient alternative for addressing complex PDEs, showcasing the potential of combining quantum computing principles with classical neural network architectures.
Methodology
The QCPIKAN framework utilizes Chebyshev-polynomial KAN layers combined with parameterized quantum circuits. Physical constraints are embedded in the training loss function, allowing the network to adhere to physical laws while learning from data. The theoretical analysis is grounded in approximation theory to establish convergence properties.
Results
QCPIKAN demonstrated improved global prediction accuracy and local error control across various seepage scenarios compared to existing models. The framework effectively tracked dynamic evolutions and localized displacement fronts, showcasing its robustness and efficiency in solving complex PDEs.
Implications
The development of QCPIKAN has significant implications for scientific computing, particularly in fields requiring the solution of complex PDEs, such as fluid mechanics, bioengineering, and subsurface flow simulations. It opens avenues for further research into hybrid quantum-classical approaches in machine learning.
Breaking chains with trees: Deep learning with $(log N)$ parallel time complexity
Efficient ML
Computer Vision
NLP
- HBLL allows training of deep neural networks without full backpropagation, improving scalability and parallelism.
- The framework achieves O(log N) parallel time complexity, significantly enhancing computational efficiency.
- HBLL demonstrates competitive performance on challenging benchmarks in vision classification and language modeling.
- The method supports flexible inference by defining subnetworks based on hierarchical paths.
Read more
Breaking chains with trees: Deep learning with $(log N)$ parallel time complexity
Summary
This paper introduces Hierarchical Block-Local Learning (HBLL), a novel framework designed to train deep neural networks without the need for full end-to-end backpropagation. Traditional backpropagation suffers from limitations such as locking, which restricts parallel updates across layers, and the weight transport problem, which necessitates symmetric pathways for gradient computation. HBLL addresses these issues by decomposing networks into hierarchically organized blocks that utilize local learning objectives derived from variational principles. This approach allows for effective information propagation while eliminating global gradient dependencies, achieving a parallel time complexity of O(log N), where N is the number of layers. The authors demonstrate the efficacy of HBLL on various vision and language modeling tasks, achieving competitive results on benchmarks like CIFAR-10, CIFAR-100, and WikiText-103. Additionally, they extend HBLL to recurrent neural networks, showcasing its versatility in sequential model training. The framework not only enhances computational efficiency but also enables flexible inference by defining a family of subnetworks corresponding to different hierarchical paths.
Methodology
The authors propose a framework called Hierarchical Block-Local Learning (HBLL) that decomposes deep neural networks into hierarchically linked blocks. Each block is trained using local learning objectives derived from a variational formulation, allowing for distributed training without global error propagation. The framework employs invertible transformations to facilitate information flow across the hierarchy, thus avoiding the interdependencies typical of traditional backpropagation.
Results
HBLL was evaluated on several tasks, including vision classification on CIFAR-10 and CIFAR-100, as well as autoregressive language modeling on WikiText-103. The results indicate that HBLL achieves competitive performance compared to traditional backpropagation methods while significantly reducing computational overhead and enabling parallel training.
Implications
The proposed HBLL framework has the potential to revolutionize deep learning by enabling more efficient training of large-scale neural networks. Its ability to operate without full backpropagation could lead to reduced computational costs and energy consumption, making it suitable for deployment in resource-constrained environments. Additionally, the flexibility in inference could enhance model adaptability in various applications.
An Empirical Study of OpenPangu Quantization on Ascend NPUs
NLP
Large Language Models
Efficient ML
- 8-bit weight-only quantization is effectively lossless for OpenPangu models.
- 4-bit quantization is practical for the 7B model but harmful for the 1B model.
- Ultra-low precision quantization (2-bit and binary) often results in poor performance.
- The study provides a comprehensive evaluation of various quantization methods on Ascend NPUs.
Read more
An Empirical Study of OpenPangu Quantization on Ascend NPUs
Summary
This paper investigates the robustness of OpenPangu models under aggressive post-training quantization (PTQ) on Huawei Ascend NPUs. The authors conduct a systematic empirical study of the OpenPangu 1B and 7B models, evaluating various quantization methods including RTN, GPTQ, AWQ, SmoothQuant, GPTAQ, BiLLM, and SliM-LLM across 18 evaluation tasks. The study reveals that 8-bit weight-only quantization is effectively lossless for both models, while 4-bit quantization is practical for the 7B model but detrimental for the 1B model, particularly in reasoning, math, and code tasks. The results highlight challenges in ultra-low precision quantization, with most 2-bit and binary settings leading to near-random behavior. This research provides an NPU-oriented accuracy map for selecting quantization settings for OpenPangu models and emphasizes the difficulties associated with extreme low-bit compression.
Methodology
The authors systematically evaluated OpenPangu 1B and 7B models using a range of post-training quantization methods. They maintained a unified calibration and evaluation protocol across various quantization settings, testing both weight-only and weight-activation methods. The evaluation included perplexity measurements on language modeling tasks and accuracy assessments on commonsense reasoning and knowledge benchmarks.
Results
The results indicate that while 8-bit quantization maintains model performance, 4-bit quantization is less effective for smaller models. The study found that ultra-low precision quantization methods (2-bit and binary) generally led to significant performance degradation, with some configurations resulting in non-finite perplexity values. The findings provide a detailed accuracy map for selecting quantization settings tailored for Ascend NPUs.
Implications
The findings have significant implications for deploying large language models in resource-constrained environments, particularly in private and domain-specific applications. The study aids in understanding the trade-offs involved in quantization, guiding practitioners in selecting appropriate methods and bit-widths for effective model deployment.
Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
Optimization
Theory
- Introduces a model for bandit optimization with C-approximately convex and β-smooth function sequences.
- Establishes expected regret guarantees that account for adversarial perturbations under a global budget.
- Demonstrates that sublinear expected regret is achievable even with non-convex losses.
- Modifies existing bandit algorithms to accommodate the new perturbation model.
Read more
Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
Summary
This paper addresses the problem of adversarial bandit optimization where the loss functions are allowed to be non-convex and non-smooth. The authors propose a framework where the learner selects actions and incurs losses that consist of a convex, β-smooth component and an adversarial perturbation, which is subject to a global budget constraint on its cumulative magnitude over time. This model extends previous work by allowing for general convex and β-smooth losses instead of just linear losses. The authors establish expected regret guarantees that account for the perturbation budget, demonstrating that sublinear expected regret can still be achieved even when the observed losses deviate from convexity, provided the cumulative deviation is controlled. The analysis involves modifying a standard bandit optimization algorithm and separating the contributions of the convex components from the perturbations, leading to a clearer understanding of how perturbations affect regret. The results indicate that the proposed method can effectively handle adversarial perturbations while maintaining performance in bandit optimization settings.
Methodology
The authors modify a standard bandit optimization algorithm to accommodate a model where losses are composed of a convex component and an adversarial perturbation. They develop a regret analysis that disentangles the contributions of the convex components from the perturbations, allowing for a clearer understanding of the regret incurred due to the perturbations. The analysis employs a bandit smoothing argument to control the expected regret.
Results
The paper establishes that under the global perturbation budget assumption, the expected regret can be controlled and remains sublinear, even when the observed losses are not strictly convex. The results provide explicit regret bounds that depend on the perturbation budget, demonstrating the effectiveness of the proposed approach in handling adversarial perturbations.
Implications
The findings have significant implications for online decision-making scenarios where loss functions may be subject to adversarial influences, such as in online pricing, resource allocation, and other applications where querying the system is costly. The ability to maintain performance despite non-convex perturbations broadens the applicability of bandit optimization techniques.
Parameterized Representations via Implicit Stochastic Modulation for High-Dimensional and High-Order Neural PDE Solvers
Theory
Optimization
Efficient ML
- PRISM decouples parameter encoding from the spatial AD graph, addressing memory growth issues.
- The architecture enables zero-shot extrapolation for parameterized PDEs without retraining.
- PRISM supports efficient scaling of high-dimensional PDEs, achieving up to 100,000 dimensions on a single GPU.
- Variance-aware Lipschitz damping is incorporated to enhance optimization stability.
Read more
Parameterized Representations via Implicit Stochastic Modulation for High-Dimensional and High-Order Neural PDE Solvers
Summary
This paper addresses the challenges of solving high-dimensional and high-order partial differential equations (PDEs) using neural networks. Traditional methods struggle with the curse of dimensionality, leading to excessive memory and computational costs. Recent advancements in stochastic derivative estimators have improved scalability but are limited to fixed parameter environments, necessitating retraining for each new parameter configuration. The authors introduce Parameterized Representations via Implicit Stochastic Modulation (PRISM), a novel architecture that decouples parameter encoding from the spatial automatic differentiation (AD) graph. This decoupling mitigates memory growth and variance issues associated with high-order stochastic solvers. PRISM employs a hyper-generator to process physical parameters, producing modulators that scale and shift a spatial latent manifold. The architecture achieves zero-overhead AD decoupling and provides variance-aware Lipschitz damping, enabling efficient training of parameterized PDEs up to 100,000 dimensions on a single GPU. The proposed method supports zero-shot extrapolation and physical inversion while avoiding the instabilities of conventional architectures.
Methodology
The authors propose the PRISM architecture, which utilizes a hyper-generator to create affine modulators that adjust a continuous latent manifold. This design allows for the separation of physical parameters from the spatial computation graph, thus preventing the entanglement that leads to increased memory usage and optimization instability. The architecture leverages stochastic derivative estimators to efficiently compute gradients without the computational overhead of traditional methods.
Results
Extensive experiments demonstrate that PRISM can effectively scale stochastic solvers for highly non-linear parameterized PDEs, achieving performance on par with or exceeding existing methods while maintaining memory efficiency. The architecture successfully enables zero-shot extrapolation and physical inversion, showcasing its versatility and robustness in handling extreme physical conditions.
Implications
The PRISM framework has significant implications for various scientific fields that rely on solving high-dimensional PDEs, such as finance, control theory, and physics. Its ability to handle parameterized PDEs without retraining opens new avenues for real-time simulations and digital twin applications, enhancing the efficiency of modeling complex systems.
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Efficient ML
Theory
Optimization
- PIBLS is the first application of Broad Learning System (BLS) to solving PDEs, offering a backpropagation-free computational framework.
- The framework reformulates PDE solving as a direct least-squares optimization, enhancing computational efficiency.
- Rigorous mathematical proof establishes PIBLS's universal approximation property for PDE solutions.
- Experimental results show PIBLS is significantly faster and more accurate than traditional PINNs.
Read more
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Summary
This paper introduces the Physics-Informed Broad Learning System (PIBLS), a novel framework designed to solve partial differential equations (PDEs) more efficiently than traditional numerical methods and existing Physics-Informed Neural Networks (PINNs). Traditional numerical solvers, while robust, are often limited by high computational costs due to mesh dependencies. In contrast, PINNs provide a mesh-free alternative but struggle with slow convergence and optimization instability. PIBLS addresses these issues by reformulating PDE solving as a direct least-squares optimization problem, allowing for faster and more stable solutions. The authors present a unique solver strategy that includes an analytical solution for linear PDEs and an enhanced nonlinear least-squares perturbation algorithm for nonlinear PDEs. They also provide a rigorous mathematical proof of PIBLS's universal approximation property, ensuring its capability to approximate solutions to PDEs. Experimental results demonstrate that PIBLS is one to three orders of magnitude faster than conventional PINNs while achieving significantly higher accuracy, establishing it as a promising alternative for real-time simulation and design optimization tasks in scientific machine learning.
Methodology
The PIBLS framework utilizes a Broad Learning System architecture, where input coordinates are projected into a system of randomly generated feature and enhancement nodes. The output is computed as a linear combination of these nodes, with the weights optimized through least-squares methods. The methodology includes deriving analytical derivatives for the network output and formulating the PDE solving task as a least-squares optimization problem, with specific strategies for both linear and nonlinear PDEs.
Results
The experiments conducted demonstrate that PIBLS outperforms conventional PINNs by being one to three orders of magnitude faster while achieving significantly higher solution accuracy across various linear and nonlinear PDEs.
Implications
The PIBLS framework has the potential to revolutionize the computational efficiency of scientific machine learning applications, particularly in real-time simulations and design optimization tasks across physical, biological, and engineering systems.
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
Time Series
- Introduces PaAno+, a lightweight model for time series anomaly detection.
- Utilizes multiscale feature extraction and cross-variable attention to improve anomaly detection accuracy.
- Implements a novel self-supervised learning task for better feature representation.
- Demonstrates state-of-the-art performance on the TSB-AD benchmark.
Read more
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
Summary
The paper presents PaAno+, a novel lightweight model designed for time series anomaly detection, addressing the limitations of existing methods that either incur high computational costs or fail to adequately capture multivariate dependencies. The model employs a patch-oriented representation learning approach, incorporating a multiscale feature extraction backbone that utilizes convolutional kernels with varying receptive fields to effectively capture hierarchical temporal characteristics. Additionally, it integrates cross-scale adaptive attention aggregation and a cross-variable fusion attention module to enhance the model's ability to identify anomalies in complex operational conditions. A unique self-supervised learning task based on temporal patch-window sorting is introduced to reveal the intrinsic structural properties of time series data, while triplet loss is used to optimize the patch embedding space for improved feature discrimination. Experimental results on the TSB-AD benchmark demonstrate that PaAno+ achieves state-of-the-art detection accuracy for both univariate and multivariate tasks, significantly outperforming previous models while maintaining computational efficiency suitable for real-time applications.
Methodology
The methodology involves a patch-oriented representation learning framework with a multiscale feature extraction backbone using convolutional kernels of varying sizes. It incorporates cross-scale adaptive attention aggregation and a cross-variable fusion attention module to capture inter-variable correlations. A self-supervised learning task based on temporal patch-window sorting is employed, along with triplet loss for optimizing the feature embedding space.
Results
PaAno+ achieved state-of-the-art detection accuracy on the TSB-AD benchmark for both univariate and multivariate anomaly detection tasks, showing significant performance improvements across various evaluation metrics compared to previous models.
Implications
The proposed model's efficiency and accuracy make it suitable for real-time anomaly detection in critical applications such as industrial monitoring and medical diagnostics, particularly in environments with limited computational resources.
Computational Identifiability
Theory
- Introduction of computational identifiability as a practical, computation-bound notion of identifiability.
- Formalization of the relationship between causal effect estimation and meta-learning.
- Empirical demonstration of computational identifiability in complex scenarios.
- Provision of a framework for identifying causal effects with finite samples and error tolerances.
Read more
Computational Identifiability
Summary
This paper introduces the concept of 'computational identifiability,' which contrasts with traditional theoretical identifiability in causal inference. The authors argue that while theoretical identifiability relies on idealized conditions such as infinite data, computational identifiability focuses on the practical aspects of identifying causal effects through finite computational procedures. The framework defines successful identification as the existence of an estimator that meets specified error tolerances and confidence bounds, given a prior distribution over parameters. The paper empirically demonstrates this framework across various complex scenarios, including small sample sizes, ambiguous graphical criteria, and mixed observational-interventional data. By formalizing the connection between causal effect estimation and meta-learning, the authors provide a comprehensive approach to tackling identification questions that are often challenging under traditional methods.
Methodology
The authors propose a framework for computational identifiability that involves defining a meta-prior over parameters and a hypothesis space of estimators. They conduct empirical experiments to validate their framework across various scenarios, including small sample sizes and mixed data types.
Results
The experiments show that computational identifiability can effectively address identification questions in settings where traditional theoretical methods fall short. The framework successfully identifies causal effects even with limited data and ambiguous criteria, demonstrating its practical applicability.
Implications
The concept of computational identifiability has significant implications for causal inference in real-world applications, particularly in fields where data is limited or complex. It provides a new lens through which researchers can approach identification problems, potentially leading to more robust causal analyses.
Learning a Normal World Model for Few-Shot Boundary-Calibrated Abnormality Detection
Time Series
- Introduces a normal world modeling framework for few-shot abnormality detection.
- Develops an entropy-aware normal-world energy for quantitative evaluation of abnormality.
- Demonstrates strong performance on the NASA C-MAPSS turbofan degradation benchmark.
- Mechanistic validation tests confirm the model captures the structure of normal behavior.
Read more
Learning a Normal World Model for Few-Shot Boundary-Calibrated Abnormality Detection
Summary
This paper addresses the challenges of abnormality detection in complex systems, particularly the scarcity of abnormal labels and the inadequacy of binary labels to quantify deviations from normal behavior. The authors propose a novel approach called the Hypergraph Entropic Normal-World Model, which learns a representation of the normal world from abundant normal events while using a few abnormal examples solely to calibrate the boundary of normality. The model constructs context-conditioned hypergraphs to capture high-order relationships among multivariate sensor data and defines abnormality through an entropy-aware normal-world energy that integrates temporal prediction surprise, hypergraph consistency surprise, and latent normal-manifold departure. The proposed method is evaluated on the NASA C-MAPSS turbofan degradation benchmark, demonstrating strong performance in zero-shot and few-shot scenarios, particularly achieving an AUROC of 0.9983 on the most complex subset. Mechanistic validation tests indicate that the learned energy effectively encodes the structure of the normal world, providing a robust anomaly score and a graded risk measure under conditions of severe abnormal-label scarcity.
Methodology
The methodology involves constructing a Hypergraph Entropic Normal-World Model that represents multivariate sensor data as context-conditioned hypergraphs. The model learns three aspects of normality: temporal dynamics, hypergraph consistency, and latent manifold representation. It combines these into an entropy-aware normal-world energy, which serves as a normality score. During few-shot calibration, the model parameters remain fixed, adjusting only the decision threshold based on a few abnormal examples.
Results
The proposed model achieved an AUROC of 0.9983 on the FD004 subset of the NASA C-MAPSS dataset, indicating exceptional performance in detecting abnormalities. The results also showed strong zero-shot and few-shot performance across all subsets, with mechanistic tests validating that the learned energy captures the underlying normal-world structure.
Implications
The findings suggest that the normal-world energy can be utilized as an effective anomaly score and a graded risk measure, making it applicable in various fields such as industrial fault diagnosis, clinical monitoring, and cyber-physical systems, especially in scenarios with limited abnormal data.
From Handcrafted Features to Functional Edge Learning: Evolution of EEG Seizure Detection Frameworks
Time Series
Interpretability
Efficient ML
- Deep Learning models for EEG analysis face significant challenges in clinical deployment due to their black-box nature and high data requirements.
- Kolmogorov-Arnold Networks (KANs) offer a new paradigm by using learnable activation functions, enhancing interpretability and efficiency.
- KANs are more robust to data scarcity and can facilitate cross-patient personalization without extensive retraining.
- The paper provides a structured analysis of EEG seizure detection methodologies, highlighting the need for transparent and efficient models.
Read more
From Handcrafted Features to Functional Edge Learning: Evolution of EEG Seizure Detection Frameworks
Summary
This paper reviews the evolution of EEG seizure detection frameworks, emphasizing the transition from traditional handcrafted feature-based methods to advanced deep learning (DL) architectures. While DL has improved automated EEG interpretation, its clinical application is hindered by issues such as lack of interpretability, high data requirements, and computational costs. The authors introduce Kolmogorov-Arnold Networks (KANs) as a promising alternative, which utilize flexible, learnable activation functions to enhance interpretability and efficiency. KANs address the limitations of conventional DL models by providing better parameter efficiency, interpretability, and robustness in data-scarce environments. The review systematically analyzes existing methodologies, identifies barriers to clinical deployment, and highlights the advantages of KANs, proposing them as a fundamental shift necessary for the development of patient-specific EEG monitoring systems.
Methodology
The paper conducts a comprehensive review of existing EEG seizure detection frameworks, comparing traditional machine learning approaches with deep learning models and introducing KANs. It systematically analyzes the limitations of current models and discusses the theoretical foundations and advantages of KANs in addressing these challenges.
Results
The review establishes that KANs not only improve predictive accuracy but also enhance interpretability and efficiency, making them suitable for clinical applications. The authors argue that KANs represent a paradigm shift necessary for the future of EEG monitoring systems.
Implications
The findings suggest that adopting KANs could lead to more reliable and interpretable EEG seizure detection systems, ultimately improving patient care and safety in clinical settings. This shift may also facilitate the integration of AI in routine medical practice, enhancing the capabilities of wearable and implantable devices.
On the Position Bias of On-Policy Distillation
Reinforcement Learning
Optimization
Efficient ML
- Identifies the position bias phenomenon in OPD, where early tokens provide more valuable supervision than later ones.
- Proposes IW-OPD, which adjusts token weights based on the accumulated discrepancy between student and teacher distributions.
- Demonstrates that IW-OPD converges faster and achieves better performance than standard OPD.
- Shows that the advantages of IW-OPD increase with the mismatch between teacher and student models.
Read more
On the Position Bias of On-Policy Distillation
Summary
This paper addresses the inefficiencies in On-Policy Distillation (OPD) in reinforcement learning, particularly the position bias that arises from uniformly averaging token-level losses. The authors identify that as student rollouts extend, they diverge from the teacher's distribution, leading to diminished supervision quality for later tokens. They propose Importance-Weighted On-Policy Distillation (IW-OPD), which assigns weights to tokens based on the discrepancy between the student and teacher distributions, effectively upweighting earlier tokens and downweighting later ones. Through a constrained optimization perspective, the authors demonstrate that IW-OPD converges faster than standard OPD and achieves superior final performance, particularly in scenarios with significant teacher-student mismatches. Their experiments show that IW-OPD improves learning efficiency and final performance metrics, achieving a notable increase of 6.9 points on the AIME-2025 benchmark compared to standard OPD.
Methodology
The authors analyze the position bias in OPD through constrained optimization, leading to the development of IW-OPD. This method uses a closed-form optimal policy to assign weights to tokens based on the likelihood ratio of teacher-to-student distributions, allowing for more effective supervision by emphasizing earlier tokens in the rollouts.
Results
IW-OPD significantly outperforms standard OPD in terms of convergence speed and final performance, with improvements of up to 6.9 points on the AIME-2025 benchmark. The method shows enhanced sample efficiency, particularly for smaller student models, and demonstrates that the performance gains scale with the degree of mismatch between teacher and student models.
Implications
The findings suggest that optimizing token supervision in OPD can lead to more efficient learning in reinforcement learning applications. This has potential implications for training smaller models with stronger teachers, improving the overall efficiency of model distillation processes in various applications.
Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
Time Series
Efficient ML
Theory
- The study compares four neural network architectures for predicting internal battery states.
- U-Net architecture shows superior performance with a 3% mean final-step nRMSE.
- The proposed models significantly reduce inference latency, achieving a 5.38× speed-up over traditional numerical solvers.
- Spatial inductive bias is identified as a critical factor influencing surrogate model performance.
Read more
Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
Summary
This paper presents a systematic comparison of four neural network architectures (MLP, ResNet, U-Net, FNO) designed as autoregressive state-transition operators for predicting internal states of lithium-ion batteries based on the Doyle-Fuller-Newman (DFN) model. The DFN model provides high-fidelity estimations of internal electrochemical states but is computationally intensive, making real-time applications challenging. The authors address this limitation by developing machine learning surrogates that can predict these states more efficiently. The study employs a unified training framework to ensure a controlled comparison of the architectures, focusing on their ability to generalize across various operating conditions. The results indicate that the U-Net architecture outperforms the others, achieving a mean final-step normalized root mean square error (nRMSE) of 3% after 300-step autoregressive rollouts, while also providing a significant speed-up of 5.38 times compared to the numerical solver. This research highlights the importance of spatial inductive bias in enhancing surrogate model performance, paving the way for improved battery management systems and digital twins.
Methodology
The authors formulated the problem as a discrete-time state-transition system, where the electrochemical state of the battery is represented as a state vector. They trained four different neural network architectures (MLP, ResNet, U-Net, FNO) under a unified framework using multi-step unrolling and current-conditioning to isolate the impact of spatial inductive bias on predictive accuracy and computational efficiency.
Results
The U-Net architecture achieved a mean final-step nRMSE of 3% across all internal state variables after 300-step autoregressive rollouts. It also provided a 5.38× speed-up over the numerical DFN solver, demonstrating its effectiveness in real-time applications.
Implications
The findings suggest that the U-Net architecture can be effectively utilized in next-generation battery management systems and digital twins, enhancing internal state observability and operational efficiency in lithium-ion battery applications.
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Generative Models
Optimization
Multimodal
- PG-MAP is the first framework to jointly optimize conditioning and latent variables during inference-time alignment.
- The framework employs a forward-consistency coupling, allowing coordinated updates across modalities.
- PG-MAP shows consistent improvements in alignment metrics across different diffusion models.
- Human evaluations indicate a strong preference for outputs generated using PG-MAP compared to existing baselines.
Read more
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Summary
The paper introduces PG-MAP, a novel framework for inference-time alignment of pretrained text-to-image models that addresses the limitations of existing methods which typically optimize along a single control axis. PG-MAP formulates the alignment problem as a trajectory-level Gibbs-MAP optimization, allowing for simultaneous updates of conditioning and latent variables during the denoising process. This joint optimization is guided by a frozen preference reward and is compatible with both diffusion and flow-matching models. The framework enhances alignment metrics such as PickScore and Aesthetic across various diffusion backbones and demonstrates significant performance improvements when combined with tuned classifier-free guidance. For flow-matching models, PG-MAP achieves high PickScore and human preference rates, confirming its effectiveness. The analysis reveals that the importance of conditioning and latent optimization varies with prompt types, suggesting further optimization opportunities.
Methodology
PG-MAP utilizes a training-free approach that recasts the denoising process as a proximal MAP problem, optimizing both conditioning and latent variables at each denoising step. It employs a schedule-adaptive trust region and a step-dependent active set to refine the variables dynamically, ensuring that updates are coordinated rather than additive. The framework is adaptable to both diffusion and flow-matching models, with specific adaptations for each transport type.
Results
PG-MAP achieves significant improvements in alignment metrics, with reported PickScore of 91.9% and HPS win rates of 75.7% against static baselines for flow-matching models. In human evaluations, PG-MAP garnered 60-67% preference over strong baselines, indicating its effectiveness in generating higher-quality images.
Implications
The PG-MAP framework has the potential to enhance the performance of text-to-image generation models, making them more effective in producing coherent and aesthetically pleasing images. Its ability to adaptively optimize during inference could lead to advancements in various applications involving generative models, particularly in creative fields such as art and design.
The Cost Geometry of Belief: finite-resource inference under noisy observation
Theory
- Introduces a cost geometry for beliefs based on optimal transport and Fisher information.
- Establishes that certainty is an unattainable boundary in finite-resource inference.
- Identifies three key results: a wall of certainty, an honesty condition, and a rigidity in belief geometries.
- Demonstrates that the Gaussian distribution is the most hyperbolic belief in this framework.
Read more
The Cost Geometry of Belief: finite-resource inference under noisy observation
Summary
This paper introduces a novel framework for understanding the geometry of beliefs in the context of finite-resource inference under noisy observations. The author proposes a cost geometry that quantifies the cost of transitioning between beliefs, utilizing optimal transport in Wasserstein space, adjusted by Fisher information to reflect the precision of beliefs. The study highlights that certainty, represented as a perfect twin of a system, is unattainable due to both observational and physical constraints, which are encapsulated by Fisher information. The author presents three main results: (1) a 'wall' indicating that well-posed inference requires certainty to be infinitely distant when costs dominate Fisher information; (2) an 'honesty' condition where uniform cost leads to geometries proportional to Fisher information; and (3) a 'rigidity' result showing that these geometries are hyperbolic, with the Gaussian distribution being the most hyperbolic belief. The implications of this work extend to algorithmic applications, such as Kalman filters, which maintain uncertainty and revise beliefs at finite costs, contrasting with systems that operate at the boundary of certainty.
Methodology
The paper employs a theoretical approach, utilizing concepts from optimal transport, information geometry, and Bayesian inference to characterize the geometry of beliefs. It formalizes the relationship between belief transitions and costs, integrating Fisher information to define a cost metric that governs the movement within the belief space.
Results
The main results include the identification of a wall that prevents certainty from being reached, an honesty condition that aligns cost with Fisher information, and the rigidity of hyperbolic geometries in belief space. The findings suggest that the cost of achieving a certain level of precision diverges as one approaches certainty, establishing a geometric floor for belief transitions.
Implications
This framework has significant implications for fields that rely on inference under uncertainty, such as robotics, data assimilation, and machine learning. It provides a geometric perspective on belief dynamics that can enhance algorithms like Kalman filters, improving their efficiency and robustness in uncertain environments.
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Efficient ML
- Investigates efficient network inference for GNSS interference characterization under strict resource constraints.
- Utilizes a deployment-oriented compression pipeline combining pruning and quantization with MCUNet as a baseline.
- Applies hardware-aware zero-shot NAS to optimize network architecture and pruning configurations.
- Demonstrates trade-offs between predictive performance and deployment efficiency through experimental evaluations.
Read more
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Summary
This paper addresses the challenge of efficient network inference for embedded global navigation satellite system (GNSS) interference monitoring, which requires rapid and memory-efficient processing of large volumes of in-phase and quadrature (IQ) samples. The authors propose a framework that combines iterative structured pruning, post-training static quantization, and hardware-aware zero-shot neural architecture search (NAS) to optimize deep neural networks (DNNs) for resource-constrained environments. Starting from the MCUNet architecture, the study evaluates how model compression and architecture optimization impact model size, computational complexity, and memory usage while preserving performance. Experiments conducted on a GNSS interference dataset demonstrate the effectiveness of the proposed methods, revealing that the combination of compression techniques and hardware-aware design significantly enhances the deployability of ML models on embedded platforms such as the iMXRT1062 MCU and Raspberry Pi devices. The findings offer practical insights for developing compact ML models suitable for real-time GNSS interference monitoring.
Methodology
The methodology involves a combination of iterative structured pruning to reduce model complexity, post-training static quantization to minimize memory usage, and hardware-aware zero-shot neural architecture search (NAS) to optimize network design without full training. The approach is evaluated using a GNSS interference dataset, focusing on both classification and characterization tasks.
Results
The results indicate that the proposed framework effectively reduces model size and computational requirements while maintaining high predictive performance. The experiments reveal that the optimized models can operate efficiently on embedded platforms, providing practical configurations that are competitive with uncompressed baselines.
Implications
The findings have significant implications for the deployment of machine learning models in resource-constrained environments, particularly for real-time applications in GNSS interference monitoring. The methodologies developed can be applied to other domains requiring efficient model inference on embedded systems.
Robustness Cannot be Reduced to Regularization: Studying Adversarial Training Beyond the Linear Case
Theory
Optimization
Efficient ML
- Adversarial training is effective but computationally expensive.
- No equivalence between adversarial risk and regularized risk exists for two-layer networks.
- The impossibility of reformulating adversarial risk extends to deeper architectures.
- The study emphasizes the need for new methodologies in adversarial training beyond linear models.
Read more
Robustness Cannot be Reduced to Regularization: Studying Adversarial Training Beyond the Linear Case
Summary
This paper addresses the significant challenge of adversarial vulnerability in machine learning models, particularly focusing on the limitations of adversarial training beyond linear models. While adversarial training has proven effective, its high computational cost poses a barrier to practical implementation. Previous research has shown that for linear models, adversarial risk can be reformulated as a regularized risk, allowing for more efficient training. However, this paper demonstrates that such an equivalence does not hold for two-layer networks, a class of models that is more expressive than linear models. The authors provide formal proofs indicating that the adversarial risk cannot be simplified into a regularized form that exhibits weak data dependence. They further extend their analysis to deeper architectures like Wide-ResNets, providing empirical evidence that the impossibility of such reformulations persists. This work highlights the fundamental differences between adversarial risk and regularized risk, suggesting that new approaches are needed for robust training in complex models.
Methodology
The authors conducted a theoretical analysis of adversarial risk in two-layer networks, employing formal proofs to demonstrate the lack of equivalence with regularized risk. They also provided empirical evaluations on Wide-ResNets to support their theoretical findings.
Results
The main result is the formal proof that adversarial risk cannot be expressed as a simple regularized risk for two-layer networks. Empirical evidence suggests that this limitation persists in more complex architectures, indicating a fundamental challenge in adversarial training.
Implications
The findings suggest that current approaches to adversarial training may not scale effectively to more complex models, necessitating the development of new strategies for ensuring robustness in machine learning systems, particularly in safety-critical applications.
On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA
Theory
- Demonstrates a significant curse of dimensionality in DP covariance estimation and PCA.
- Establishes poly(k, log d) sample complexity for DP PCA under additional sparsity assumptions.
- Provides poly(d) lower bounds for both sparse covariance estimation and PCA under DP.
- First to show an exponential gap between private and non-private sample complexities in sparse estimation.
Read more
On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA
Summary
This paper investigates the challenges of high-dimensional differentially private (DP) covariance estimation and principal component analysis (PCA) under k-row-column sparsity (k-RCS) of the covariance matrix. It highlights a significant gap in sample complexity between private and non-private settings, where non-private methods require poly(k, log d) samples, while existing DP methods necessitate Ω(d) samples. The authors demonstrate that under certain conditions, specifically when the leading eigenvector is sparse, it is possible to achieve poly(k, log d) sample complexity for DP PCA. They also establish lower bounds showing an exponential gap between private and non-private variants when k is polylogarithmic in d. This work is notable for being the first to demonstrate such a separation in the context of sparse estimation problems in private high-dimensional statistics, and it provides insights into the inherent challenges of achieving differential privacy in these settings.
Methodology
The authors develop new algorithms for differentially private sparse covariance estimation and PCA, leveraging structural assumptions about the covariance matrix and the leading eigenvector. They analyze the privacy and utility of these algorithms, providing both upper and lower bounds for sample complexity under different models of sparsity.
Results
The paper presents upper bounds showing that under certain sparsity conditions, poly(k, log d) samples are sufficient for DP PCA, while establishing lower bounds indicating that Ω(d) samples are necessary for general DP covariance estimation and PCA. This reveals a stark contrast in sample requirements between private and non-private settings, particularly when k is polylogarithmic in d.
Implications
The findings suggest that achieving differential privacy in high-dimensional statistics, particularly for sparse covariance estimation and PCA, is fundamentally more challenging than in non-private settings. This has implications for the design of algorithms in sensitive data applications, where privacy guarantees are essential.
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models
NLP
Large Language Models
- SamatNext v0.2-B demonstrates improved retention of prior capabilities compared to a standard Transformer baseline.
- The hybrid architecture effectively balances retention and plasticity in curriculum learning settings.
- Despite improvements, both models face challenges with catastrophic forgetting, particularly in early-stage syntax tasks.
- The study emphasizes the importance of structured curriculum learning in training adaptive models.
Read more
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models
Summary
This paper presents SamatNext v0.2-B, a 356-million-parameter hybrid sequence decoder designed to mitigate catastrophic forgetting in autoregressive Transformer models trained on sequential curriculum distributions. The model integrates Differential-Attention-style layers and DeltaNet-inspired simplified linear-state mixer layers, enhanced with RMS Normalization and Output Scale Calibration. The study evaluates the model's performance on a structured Python code curriculum, comparing it to a parameter-matched Transformer baseline. Results indicate that SamatNext v0.2-B achieves a 100% pass rate on a controlled Stage 5 holdout while retaining 98.8% of its capabilities from an adjacent Stage 3. In contrast, the baseline Transformer achieves a 49.4% pass rate on Stage 5 but retains only 3.8% of Stage 3 performance. Even with an increased learning rate, the baseline struggles with retention. Both models exhibit performance degradation on early-stage adversarial syntax tasks, highlighting the ongoing challenge of catastrophic forgetting in long-horizon training scenarios. The authors provide code, model specifications, and evaluation scripts for independent verification, framing this work as an exploratory study rather than a definitive solution to the problem.
Methodology
The methodology involves training the SamatNext v0.2-B model on a multi-stage Python curriculum, alternating between Differential-Attention and linear-state mixer layers. The model's performance is evaluated against a parameter-matched Transformer baseline using controlled pass/fail metrics.
Results
SamatNext v0.2-B achieved a 100% pass rate on Stage 5 and retained 98.8% of its Stage 3 capabilities, while the baseline Transformer achieved only a 49.4% pass rate on Stage 5 with a mere 3.8% retention of Stage 3 performance. Both models struggled with early-stage syntax tasks, indicating limitations in addressing catastrophic forgetting.
Implications
The findings suggest that hybrid architectures may offer a pathway to improve retention in sequential learning tasks, particularly in code generation. This could have implications for developing more adaptive and continually learning systems in programming and other domains.
When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage
Federated Learning
Computer Vision
Theory
- Quantifies the marginal-conditional coverage gap in federated CRC for medical segmentation.
- Proposes a shrinkage-based federated CRC protocol that enhances prediction set efficiency.
- Demonstrates that naive pooling of calibration scores can lead to critical failures in individual site coverage.
- Identifies the necessity of finite-sample correction terms to avoid excessive violations.
Read more
When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage
Summary
This paper addresses the challenges of deploying conformal risk control (CRC) in federated learning settings, particularly in medical segmentation tasks across multiple hospitals. The author highlights that traditional methods of pooling calibration scores can lead to significant violations of coverage guarantees at individual institutions, despite appearing well-calibrated on average. Using a dataset of 1,251 brain tumor volumes from 20 institutions, the study quantifies the marginal-conditional coverage gap, revealing that 40% of hospitals exceed the target false-negative rate. The paper proposes a novel shrinkage-based federated CRC protocol that allows each site to transmit only its empirical risk curve to a central server, which then computes a shrinkage-regularized threshold for each site. This approach effectively balances coverage and prediction set efficiency, achieving a significant reduction in violations while maintaining the integrity of the CRC guarantees. The findings underscore the importance of considering site-specific characteristics in federated learning applications in healthcare.
Methodology
The study employs a shrinkage-based federated CRC protocol where each site computes its local empirical risk curve and transmits it to a central server. The server then computes a shrinkage-regularized threshold for each site, allowing for a trade-off between worst-case coverage and prediction set efficiency. The method includes sensitivity analysis to identify optimal hyperparameters.
Results
The proposed method significantly reduces the number of coverage violations from 8 out of 20 institutions to only 2.7 violations at a 2.0× stretch, demonstrating improved efficiency in prediction sets while maintaining coverage guarantees. The study also shows that direct Lagrangian optimization fails to protect vulnerable hospitals, emphasizing the importance of the finite-sample correction term.
Implications
The findings have significant implications for the deployment of machine learning models in healthcare, particularly in ensuring reliable segmentation outputs across diverse hospital settings. The proposed method can enhance the safety and effectiveness of clinical decision-making by providing better-calibrated predictions.
Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather, Calendar, and COVID-19 Indicators
Time Series
- The hybrid Transformer-XGBoost framework significantly outperforms a tabular-only XGBoost model in short-term electricity demand forecasting.
- COVID-19 indicators initially improved model accuracy but became less relevant as behavioral adaptations occurred post-pandemic.
- Hyperparameter optimization using Optuna enhanced the model's performance through efficient search strategies.
- The study emphasizes the importance of considering temporal validity decay in forecasting models affected by structural changes in demand patterns.
Read more
Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather, Calendar, and COVID-19 Indicators
Summary
This paper addresses the challenge of accurate short-term electricity demand forecasting in New England, which is crucial for effective power system operation and market planning. The authors propose a hybrid framework that combines a Transformer encoder for temporal feature extraction with XGBoost, a gradient-boosted decision tree model, to forecast daily electricity demand. The model incorporates a variety of features, including meteorological data from six cities across New England, calendar effects, autoregressive demand lags, and COVID-19 indicators. Hyperparameter optimization is performed using Optuna, resulting in a robust model that achieves a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906. The study also conducts an ablation analysis to assess the impact of COVID-19 features, revealing that while these features improved training accuracy, their predictive value diminished over time as behavioral adaptations occurred post-pandemic. The findings highlight the importance of temporal validity decay in forecasting models, suggesting that reliance on outdated pandemic patterns can lead to overfitting and reduced accuracy in predictions.
Methodology
The authors developed a hybrid forecasting model that utilizes a Transformer encoder for extracting temporal features from historical demand data, which is then combined with a rich set of engineered features (including weather, calendar, and COVID-19 indicators) and fed into an XGBoost model. Hyperparameter optimization was conducted using Optuna, employing a Bayesian optimization approach to fine-tune model parameters across 500 trials.
Results
The hybrid model achieved a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906, outperforming a baseline XGBoost model with RMSE of 9,304 MWh and MAPE of 2.21%. An ablation study indicated that removing COVID-19 features decreased the hybrid model's RMSE by 3.2%, while marginally improving the XGBoost-only model by 1.2%. The Diebold-Mariano test confirmed that the performance difference was statistically insignificant.
Implications
The findings suggest that while hybrid models can enhance forecasting accuracy, it is crucial to continuously evaluate the relevance of features, especially those influenced by external events like the COVID-19 pandemic. This research can inform energy market operators and policymakers about the dynamics of electricity demand and the importance of adapting forecasting models to changing behavioral patterns.
Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients
Theory
Optimization
- Introduction of the Shift-Invariant Variance Estimator (SIVE) to bypass minimization bias in LLC estimation.
- SIVE structurally eliminates the need for the local minimum by using variance and a noise-debiasing correction.
- Controlled experiments validate SIVE's effectiveness in recovering geometric signals in off-equilibrium settings.
- SIVE is scalable to deep neural networks, enabling real-time tracking of structural phase transitions during training.
Read more
Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients
Summary
This paper addresses a critical limitation in the application of Singular Learning Theory (SLT) for estimating Local Learning Coefficients (LLC) during off-equilibrium training phases of neural networks. Traditional LLC estimators rely on knowing the local minimum of the loss landscape, which is often inaccessible during training. The authors introduce the Shift-Invariant Variance Estimator (SIVE), which circumvents the need for this local minimum by utilizing a variance-based approach. SIVE employs the variance operator to eliminate the unknown additive baseline that typically introduces minimization bias. The authors derive a correction based on the Law of Total Variance to separate true geometric fluctuations from noise in mini-batch evaluations. Through controlled experiments on toy models, SIVE demonstrates its ability to recover expected geometric signals where conventional mean estimators fail. When applied to deep neural networks, SIVE serves as a robust diagnostic tool for tracking structural phase transitions throughout the training process, providing insights into the dynamics of the loss landscape.
Methodology
The methodology involves formulating LLC estimation as an off-equilibrium local probe using the variance operator. The authors derive an explicit correction for mini-batch noise using the Law of Total Variance, allowing SIVE to decouple geometric fluctuations from stochastic evaluation noise. The approach is validated through controlled experiments on toy models and applied to deep neural networks.
Results
SIVE successfully recovers the expected finite-temperature geometric signals in scenarios where traditional mean estimators fail. It also demonstrates the capability to track structural phase transitions in deep neural networks throughout the training process.
Implications
The development of SIVE provides a new tool for researchers to analyze the geometry of loss landscapes in neural networks, particularly during off-equilibrium training phases. This can enhance understanding of training dynamics and improve diagnostic capabilities in machine learning.
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
Theory
Optimization
- Introduces Riemannian sharpness as an invariant measure of flatness under reparametrization.
- Establishes a connection between SGD's implicit bias and Riemannian flat minima through a derived SDE.
- Demonstrates a PAC-Bayes generalization bound explicitly controlled by Riemannian sharpness.
- Empirical validation shows Riemannian sharpness better predicts generalization than Euclidean sharpness.
Read more
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
Summary
This paper addresses the widely accepted notion that stochastic gradient descent (SGD) favors flat minima, which are believed to generalize better in deep learning contexts. The authors critique existing measures of flatness, such as the trace or maximum eigenvalue of the loss Hessian, for lacking invariance under reparametrizations that preserve the network function. To resolve this, they introduce a framework based on Riemannian geometry using the Fisher Information Matrix (FIM) to define Riemannian sharpness, which is invariant under such reparametrizations. The study formalizes the gradient noise of mini-batch SGD as having a covariance structure proportional to the FIM, deriving the stationary distribution of the stochastic differential equation (SDE) and demonstrating that probability mass is concentrated at Riemannian-flat minima. A PAC-Bayes generalization bound is established, linking Riemannian sharpness to test performance. Empirical results on MNIST and CIFAR-10 validate that Riemannian sharpness correlates with generalization more reliably than traditional Euclidean sharpness. This work unifies invariant sharpness with SGD's implicit bias and generalization bounds, providing a robust theoretical foundation for understanding why flat minima generalize well.
Methodology
The authors define Riemannian sharpness using the Fisher Information Matrix (FIM) to ensure invariance under reparametrization. They formalize the gradient noise of mini-batch SGD and derive the stationary distribution of the resulting stochastic differential equation (SDE). They also establish a PAC-Bayes generalization bound linked to Riemannian sharpness.
Results
The study proves that mini-batch SGD assigns exponentially greater probability mass to Riemannian-flat minima. The empirical results on MNIST and CIFAR-10 confirm that Riemannian sharpness correlates with generalization performance, outperforming traditional Euclidean sharpness metrics.
Implications
This work provides a more robust theoretical framework for understanding the behavior of SGD in deep learning, potentially guiding the design of optimization algorithms that favor flat minima for improved generalization. It also opens avenues for further research into invariant measures of model performance.
Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
Multimodal
- cfDNA is a promising biomarker for non-invasive multi-cancer early detection.
- The review categorizes computational methods into statistical, machine learning, and deep learning approaches.
- Multimodal ensemble approaches are identified as having the highest readiness for clinical integration.
- Standardization of evaluation protocols is crucial for future research and comparison.
Read more
Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
Summary
This paper reviews the computational methods developed for analyzing cell-free DNA (cfDNA) in the context of multi-cancer early detection (MCED) from 2022 to 2025. The authors discuss the biological basis of cfDNA signals and the significance of fragmentomics and epigenetic features in cancer detection. They evaluate classical statistical methods, machine learning approaches, and deep learning frameworks, including autoencoder-based models, emphasizing their biological interpretability and clinical readiness. The review identifies key challenges in the field, including technical, computational, and methodological issues, and highlights the need for standardized evaluation protocols to facilitate future research and comparisons. The findings suggest that multimodal ensemble approaches show the most promise for clinical integration, with a strong emphasis on the importance of standardization in reporting results for better assessment of methodologies.
Methodology
The authors conducted a comprehensive review of computational methods for cfDNA analysis, focusing on fragmentomics and epigenetic features. They analyzed classical statistical methods, machine learning techniques, and deep learning frameworks, assessing their biological interpretability and clinical validation strategies.
Results
The review indicates that while various computational methods exist for cfDNA analysis, multimodal ensemble approaches are the most promising for clinical application. It also highlights the need for standardized protocols to improve the reliability and comparability of results across studies.
Implications
The findings suggest that advancements in cfDNA analysis could lead to significant improvements in early cancer detection, potentially transforming cancer screening practices and reducing treatment costs. The emphasis on standardization may enhance the clinical adoption of these technologies.
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
NLP
Large Language Models
Efficient ML
- Formalizes adapter mergeability for LoRA, separating single-task utility from post-merge retention.
- Introduces MergeProbe, a lightweight predictor that estimates mergeability based on early training signals.
- Demonstrates improved retention in merging adapters across multiple domains compared to existing methods.
- Shifts the merging process from a post-hoc evaluation to an anticipatory measurement problem.
Read more
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
Summary
This paper addresses the challenge of merging low-rank adaptation (LoRA) updates for parameter-efficient fine-tuning of language models. The authors highlight that the mergeability of adapters, which are trained for specific tasks, is often only assessed after full training, leading to costly failures when adapters that perform well independently interfere with each other post-merge. To mitigate this, they formalize the concept of adapter mergeability, defining it as the ability of an adapter to maintain its utility after being merged with others. The authors propose a lightweight predictor called MergeProbe, which utilizes early training signals to forecast mergeability, allowing for proactive decisions on whether to merge, reweight, prune, or route adapters. They validate their approach using the MERGE-PEFT benchmark across five domains, demonstrating that MergeProbe outperforms existing interference-aware merge baselines while requiring significantly less overhead. This work transforms the merging process from a reactive to an anticipatory workflow, potentially enhancing the efficiency of deploying specialized language model adapters.
Methodology
The authors define mergeability in terms of single-task utility and post-merge retention, evaluating it at pairwise, adapter, and set levels. They identify measurable signals during early training that indicate potential conflicts when merging adapters. MergeProbe aggregates these signals to inform decisions on merging, reweighting, pruning, or routing adapters.
Results
MergeProbe achieved the best average and worst-case retention rates on the MERGE-PEFT benchmark, outperforming strong interference-aware merge baselines while incurring less deployment overhead. This indicates that the early training signals effectively predict mergeability, allowing for better management of adapter updates.
Implications
The findings suggest that proactive management of adapter merging can lead to more efficient deployment of language models in various applications, reducing the risk of performance degradation when combining specialized adapters. This approach could be particularly beneficial in environments where multiple task-specific models are maintained.