AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
65
Papers today
8h
Update frequency
7
Days of history
Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs
NLP
Large Language Models
Generative Models
- Introduction of a masked data training paradigm for DLLMs that enhances reasoning capabilities.
- Development of an Information Density Driven Smart Noise Scheduler that focuses on high-density information regions.
- Implementation of Complementary Priority Masking to balance logical reasoning and syntactic structure in training.
- Empirical results show a 4% accuracy improvement on reasoning benchmarks compared to traditional methods.
Read more
Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs
Summary
This paper introduces a novel training paradigm for discrete diffusion language models (DLLMs) that addresses the limitations of traditional uniform random noise scheduling. The authors propose an Information Density Driven Smart Noise Scheduler that prioritizes high-density information regions in training data, allowing the model to focus on critical logical deductions rather than low-density syntactic elements. By employing Complementary Priority Masking, the training instances are decoupled into logical and syntactic samples, enhancing the model's ability to learn both reasoning and structural language features. Experimental results show a 4% improvement in accuracy across four benchmarks in code and math reasoning tasks compared to standard methods. Mechanistic analyses indicate that this approach effectively mitigates contextual collapse during training, demonstrating the potential of density-aware strategies in optimizing diffusion models with minimal annotation costs.
Methodology
The proposed methodology involves two main steps: first, identifying information-dense regions in the training data using predefined rules or large language models; second, applying Complementary Priority Masking to create two types of samples—logical and syntactic—ensuring the model learns both reasoning and language structure effectively.
Results
The proposed method resulted in an average accuracy improvement of 4% across four code and math reasoning benchmarks compared to models trained with standard uniform noise scheduling. Mechanistic analyses confirmed that the new noise scheduling effectively mitigated contextual collapse.
Implications
This research suggests that density-aware training paradigms can significantly enhance the performance of diffusion language models, making them more effective for complex reasoning tasks. The findings could lead to more efficient training methodologies in various applications of natural language processing and machine learning.
Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection
Theory
Efficient ML
Optimization
- Introduces the Christoffel function and its relevance to outlier detection.
- Derives computational costs for three matrix inversion update methods: DI, ISM, and WMI.
- Provides a simple rule for selecting the optimal update method based on matrix size and update rank.
- Validates theoretical findings with comprehensive simulations.
Read more
Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection
Summary
This technical note addresses the computational challenges associated with matrix inversion updates in the context of streaming outlier detection using the Christoffel function (CF) as an outlier score. The authors compare three matrix inversion update methods: Direct Inversion (DI), Iterative Sherman-Morrison (ISM), and Woodbury Matrix Identity (WMI). They derive the theoretical computational costs for each method and validate these findings through Python simulations. The study reveals that ISM is optimal for rank-1 updates, WMI is best for small updates relative to matrix size, and DI is preferable in other scenarios. This work contributes to the efficiency of online outlier detection techniques by providing a clear guideline for selecting the most suitable matrix inversion update method based on specific conditions.
Methodology
The authors derive theoretical computational costs for three matrix inversion update methods (DI, ISM, WMI) and conduct extensive Python simulations to validate these findings. The comparison focuses on the efficiency and suitability of each method in the context of streaming data.
Results
The analysis indicates that ISM is optimal for rank-1 updates, WMI is advantageous for small updates relative to the matrix size, and DI is the best choice for larger updates. The results provide a unified reference for practitioners in selecting matrix inversion strategies in streaming outlier detection.
Implications
The findings have significant implications for real-time anomaly detection systems, particularly in applications like fraud detection and quality control, where efficient processing of streaming data is crucial. The proposed guidelines can enhance the performance and scalability of outlier detection methods.
Discovering the Hidden Role of Gini Index In Prompt-based Classification
NLP
Large Language Models
Optimization
- The Gini Index serves as a valuable tool for detecting and optimizing class accuracy imbalances in classification tasks.
- Significant relative accuracy imbalances exist in prompt-based classification for both text and image data.
- A post-hoc bias mitigation method based on the Gini Index can effectively reduce accuracy disparities across classes.
- The proposed method is model-agnostic, making it applicable to various classification scenarios without the need for retraining.
Read more
Discovering the Hidden Role of Gini Index In Prompt-based Classification
Summary
This paper investigates the Gini Index's role in addressing accuracy imbalances in classification tasks, particularly in the context of prompt-based classification using large language models (LLMs) and vision models. The author highlights the challenge of long-tailed distributions in classification, where minority classes often yield critical predictions but suffer from low accuracy. The study introduces the Gini Index as a metric for detecting and optimizing disparities in class accuracy. Through empirical analysis, the paper demonstrates that significant relative accuracy imbalances exist across both text and image classification tasks. The author proposes a post-hoc, model-agnostic bias mitigation method that leverages the Gini Index to reduce these imbalances. Experimental results across various classification scenarios indicate that this method effectively minimizes both relative and absolute accuracy disparities, enhancing the performance of underrepresented classes while reducing the dominance of more frequent classes.
Methodology
The author empirically analyzes Gini scores in real-world LLMs and vision models, demonstrating the presence of accuracy imbalances. A post-hoc model-agnostic bias mitigation method is proposed, utilizing the Gini Index as an optimization metric to address these disparities in class predictions.
Results
The experimental results indicate that the Gini-based bias mitigation method significantly reduces both relative and absolute accuracy imbalances in classification tasks, leading to improved performance for minority classes and a decrease in the dominance of majority classes.
Implications
The findings suggest that leveraging the Gini Index can enhance fairness in classification tasks, particularly in critical applications where minority class performance is crucial, such as medical diagnosis and fraud detection. This approach offers a cost-effective alternative to traditional data rebalancing methods.
Simplex-to-Euclidean Bijection for Conjugate and Calibrated Multiclass Gaussian Process
Theory
Efficient ML
- Introduces a conjugate and calibrated GP model for multi-class classification.
- Utilizes Aitchison geometry to map class probabilities from simplex to Euclidean space.
- Reduces the dimensionality of latent variables from K to D (K-1) for improved efficiency.
- Compatible with sparse GP regression techniques for scalability.
Read more
Simplex-to-Euclidean Bijection for Conjugate and Calibrated Multiclass Gaussian Process
Summary
This paper presents a novel conjugate and calibrated Gaussian process (GP) model for multi-class classification by leveraging the geometry of the probability simplex. The authors utilize Aitchison geometry to transform simplex-valued class probabilities into an unconstrained Euclidean representation, thereby reformulating the classification task as a GP regression problem with reduced latent dimensions. This approach allows for conjugate inference and reliable predictive probabilities without the need for distributional approximations. The model is compatible with standard sparse GP regression techniques, facilitating scalable inference for larger datasets. Empirical evaluations demonstrate that the proposed method achieves well-calibrated and competitive performance across both synthetic and real-world datasets, outperforming existing methods in terms of calibration while maintaining a similar level of predictive accuracy.
Methodology
The authors employ the simplex-to-Euclidean bijection to map class probabilities to an unconstrained Euclidean space, allowing the use of Gaussian pseudo-observations in GP regression. This transformation enables exact posterior inference and learning through standard GP marginal likelihood, avoiding approximations typically required in multi-class GP classifiers. The model is trained using Gaussian pseudo-observations derived from class labels, leveraging the isometric log-ratio bijection for defining the GP prior.
Results
The proposed model demonstrates significant improvements in calibration compared to existing GP classifiers, with empirical results indicating that it consistently ranks among the best methods for multi-class classification. While the performance improvement over the strong baseline of Dirichlet-based GP classification is modest, the model's ability to maintain conjugacy without distributional approximations is a notable advancement.
Implications
This work has potential implications for various applications in multi-class classification tasks, particularly in fields where reliable probability calibration is crucial, such as medical diagnosis, risk assessment, and any domain requiring uncertainty quantification in predictions. The model's scalability also opens avenues for its application in larger datasets.
On the (Generative) Linear Sketching Problem
Generative Models
Theory
Efficient ML
- Identifies orthogonal information loss as a key challenge in linear sketching.
- Introduces FLORE, a generative sketching framework that enables high-quality recovery.
- FLORE can be trained without ground-truth data, enhancing its applicability.
- Demonstrates significant performance improvements over existing sketching methods.
Read more
On the (Generative) Linear Sketching Problem
Summary
This paper addresses the challenges associated with linear sketching techniques in data streaming scenarios, where the goal is to create compact summaries of continuously arriving data while maintaining accuracy and efficiency. The authors identify a fundamental issue in existing sketching methods: orthogonal information loss, which hinders the ability to recover the original data accurately. To tackle this problem, the paper explores the use of generative models (GMs) as a means to bridge the information gap. The authors propose FLORE, a novel generative sketching framework that leverages insights from their analysis to achieve high-quality recovery with minimal computational overhead. FLORE is unique in that it can be trained without access to ground-truth data, relying instead on aggregated counters. The framework is designed to support long-term operations in high-speed, high-volume data streams. Through comprehensive evaluations, FLORE demonstrates significant improvements over previous methods, achieving up to 103× reduction in error and 102× increase in processing speed compared to existing learning-based solutions.
Methodology
The authors dissect existing sketching techniques using matrix analysis based on compressive sensing theory, revealing the limitations of current methods. They then explore the application of generative models to enhance data sketching, specifically focusing on flow-based generative models (FGMs). FLORE is developed as an invertible solver that separates and generates lost information from orthogonal subspaces, trained using an Expectation Maximization algorithm with aggregated counters.
Results
FLORE achieves high-quality recovery and efficient summarization, outperforming previous methods by up to 103× in error reduction and 102× in processing speed. The framework is shown to be scalable and effective in various applications, maintaining fidelity while operating under a compact memory budget.
Implications
The findings suggest that generative models can significantly enhance the performance of sketching techniques in data streaming applications. FLORE's ability to operate without ground-truth data opens new avenues for real-time data analysis in environments where such data is unavailable. This could have implications for fields such as network monitoring, real-time analytics, and large-scale data processing.
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
Interpretability
- Introduces in-context symbolic regression methods for improved operator extraction in KANs.
- GSR and GMP methods enhance stability and robustness in symbolic regression.
- GSR achieves up to 99.8% reduction in median OFAT test MSE.
- GMP integrates operator selection into the training process, reducing computational costs.
Read more
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
Summary
This paper addresses the challenges of symbolic regression in the context of Kolmogorov-Arnold Networks (KANs), which are designed to produce interpretable analytical expressions from data. The authors identify that traditional methods for symbolic extraction, such as AutoSym, suffer from instability and error propagation due to their isolated per-edge fitting approach. To overcome these limitations, they propose two new methods: Greedy in-context Symbolic Regression (GSR) and Gated Matching Pursuit (GMP). GSR evaluates candidate operators in the context of the entire network, leading to more stable operator selection based on end-to-end loss improvement. GMP enhances this by integrating operator selection into the training process through a differentiable gated mechanism. The authors evaluate these methods using hyper-parameter robustness tests on the SRBench benchmark, demonstrating that GSR achieves a significant reduction in median one-factor-at-a-time (OFAT) test mean squared error (MSE) compared to traditional methods. This work contributes to the field of explainable AI by providing a more reliable framework for symbolic regression in scientific machine learning applications.
Methodology
The authors propose two methods: Greedy in-context Symbolic Regression (GSR), which selects operators based on end-to-end loss improvement after fine-tuning, and Gated Matching Pursuit (GMP), which uses a differentiable gating mechanism to learn operator selection during training. Both methods are evaluated through hyper-parameter sweeps on the SRBench benchmark.
Results
The experiments show that GSR significantly reduces median OFAT test MSE by up to 99.8% compared to traditional symbolic extraction methods. GMP further enhances the process by allowing operator selection to be part of the training optimization, leading to more consistent results across different hyper-parameter settings.
Implications
The findings suggest that the proposed methods can improve the reliability and interpretability of models in scientific machine learning, making them more suitable for applications requiring explainable AI. This could lead to better insights in various scientific domains where understanding functional relationships is crucial.
Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models
Time Series
- Introduction of a millisecond-resolution dataset for high-frequency time series data from 5G networks.
- Expansion of TSFM applicability to the wireless network domain.
- Demonstration of poor performance of existing TSFMs on high-frequency data in both zero-shot and fine-tuned scenarios.
- Highlighting the importance of high-frequency datasets for improving TSFM architectures and fine-tuning strategies.
Read more
Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models
Summary
This paper addresses the limitations of existing time series foundation models (TSFMs) that primarily rely on low-frequency datasets, which restrict their ability to effectively model high-frequency data. The authors introduce a novel dataset that captures millisecond-resolution data from an operational 5G wireless deployment, thereby expanding the scope of TSFMs to include high-frequency data. This dataset not only introduces a new domain—wireless networks—but also provides practical applications for short-term forecasting with prediction horizons ranging from 100 milliseconds to 9.6 seconds. The authors benchmark traditional machine learning models and TSFMs on predictive tasks using this dataset, revealing that most TSFM configurations perform inadequately in both zero-shot and fine-tuned settings. The findings emphasize the necessity of incorporating high-frequency datasets during pre-training to enhance the performance, generalization, and robustness of TSFMs in real-world applications.
Methodology
The authors developed a dataset from a 5G wireless deployment, capturing millisecond-resolution data. They benchmarked various traditional machine learning models and TSFMs on this dataset to evaluate their performance on predictive tasks, focusing on short-term forecasting capabilities.
Results
The benchmarking results indicated that most TSFM configurations struggled to perform effectively on the new high-frequency dataset, both in zero-shot and fine-tuned settings. This highlights a significant gap in the current capabilities of TSFMs when applied to high-frequency data.
Implications
The introduction of this dataset and the findings from the benchmarking can lead to improved TSFM architectures and fine-tuning strategies, enhancing their applicability in real-world scenarios, particularly in domains requiring high-frequency data analysis such as telecommunications and finance.
Benchmarking Open-Source PPG Foundation Models for Biological Age Prediction
Time Series
- AI-PPG Age model fails to generalize across different clinical populations.
- Pulse-PPG outperforms AI-PPG Age in biological age prediction.
- Fusing PPG embeddings with demographic data enhances prediction accuracy.
- The PPG age gap correlates with cardiovascular risk factors.
Read more
Benchmarking Open-Source PPG Foundation Models for Biological Age Prediction
Summary
This paper investigates the performance of open-source photoplethysmography (PPG) foundation models in predicting biological age, particularly in clinical settings. The study highlights the limitations of a previously trained model (AI-PPG Age) on a large dataset, which fails to generalize to a different clinical population, resulting in a narrow prediction range. In contrast, a general-purpose foundation model, trained without age-specific objectives, outperforms the task-specific model. The research benchmarks three open-source models—Pulse-PPG, PaPaGei-S, and AI-PPG Age—using a dataset of 906 surgical patients. The results reveal that Pulse-PPG achieves a mean absolute error (MAE) of 9.28 years, outperforming AI-PPG Age in linear probe mode. The study also finds that fusing Pulse-PPG embeddings with demographic data significantly improves prediction accuracy (MAE = 8.22 years). Furthermore, the age-adjusted PPG age gap correlates with diastolic blood pressure, indicating that PPG morphology captures vascular aging information. The findings suggest that open-source PPG models can effectively encode biological aging information, making them valuable for clinical applications.
Methodology
The study benchmarks three open-source PPG models using frozen embeddings and Ridge regression on a dataset of 906 surgical patients. It employs 5-fold stratified cross-validation to evaluate model performance and compares results against traditional heart rate and heart rate variability features.
Results
Pulse-PPG achieved a mean absolute error (MAE) of 9.28 years, while AI-PPG Age in linear probe mode had an MAE of 9.72 years. The combination of Pulse-PPG embeddings with demographic features resulted in an improved MAE of 8.22 years. The age-adjusted PPG age gap showed a significant correlation with diastolic blood pressure (r = −0.188, p = 1.2 × 10−8).
Implications
The findings suggest that open-source PPG models can be effectively used for biological age prediction in clinical settings, potentially aiding in the assessment of cardiovascular health and aging-related conditions. This could enhance the accessibility and reproducibility of biological age assessment tools.
PhasorFlow: A Python Library for Unit Circle Based Computing
Theory
Optimization
Time Series
- Introduction of PhasorFlow, a library for unit circle based computing.
- Formalization of the Phasor Circuit model with a library of 22 gates.
- Development of Variational Phasor Circuits for classical machine learning tasks.
- Implementation of a DFT-based token mixing layer in the Phasor Transformer.
Read more
PhasorFlow: A Python Library for Unit Circle Based Computing
Summary
PhasorFlow is an open-source Python library that introduces a novel computational paradigm based on the S1 unit circle. It encodes inputs as complex phasors and employs unitary wave interference gates to maintain global norm while allowing individual components to drift into complex space. The library presents three main contributions: the formalization of the Phasor Circuit model with a comprehensive 22-gate library, the introduction of Variational Phasor Circuits (VPC) for optimizing continuous phase parameters in classical machine learning tasks, and the development of the Phasor Transformer, which utilizes a DFT-based token mixing layer to replace traditional attention mechanisms. PhasorFlow is validated through various applications, including non-linear spatial classification, time-series prediction, financial volatility detection, and neuromorphic tasks. The results demonstrate that unit circle computing offers a deterministic, lightweight, and mathematically grounded alternative to classical neural networks and quantum circuits, functioning effectively on classical hardware while leveraging the principles of quantum mechanics.
Methodology
PhasorFlow employs a circuit-based programming model where users define circuits of phasor threads and apply gate operations. The library allows for analytical evaluation through direct matrix multiplication, ensuring deterministic outputs. It incorporates a variety of gate types for different computational needs and optimizes phase parameters through Variational Phasor Circuits.
Results
PhasorFlow was successfully validated on tasks such as non-linear spatial classification, time-series prediction, and financial volatility detection. The results indicate that unit circle computing can serve as a robust alternative to traditional neural networks, providing efficient and mathematically principled computation.
Implications
PhasorFlow has the potential to enhance computational efficiency in various domains, particularly those involving complex spatio-temporal dynamics, such as neuroscience, finance, and systems biology. Its ability to operate on classical hardware while utilizing principles from quantum mechanics opens new avenues for research and application in machine learning.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
NLP
Large Language Models
Efficient ML
- M2RNN introduces matrix-valued states and non-linear transitions for improved language modeling.
- The architecture overcomes limitations of traditional non-linear RNNs by expanding state size efficiently.
- Empirical results show significant performance gains in language modeling and state tracking.
- Hybrid models incorporating M2RNN layers demonstrate superior accuracy with reduced state sizes.
Read more
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
Summary
The paper introduces the Matrix-to-Matrix RNN (M2RNN), a novel non-linear recurrent neural network architecture designed to enhance language modeling capabilities. Traditional transformer models, while highly parallel, are limited in their computational expressiveness, particularly for tasks requiring complex state tracking. M2RNN addresses these limitations by utilizing matrix-valued hidden states and non-linear state transitions, which significantly improve the model's ability to track states over long sequences. The authors demonstrate that the performance of non-linear RNNs is constrained by their state size, and they propose a mechanism for state size expansion that optimally utilizes tensor cores for efficient computation. Empirical results show that M2RNN achieves perfect state tracking generalization on unseen sequence lengths and outperforms existing models, such as Gated DeltaNet hybrids, by 0.4-0.5 perplexity points while using smaller state sizes. Additionally, integrating M2RNN layers into existing hybrid architectures yields notable accuracy improvements with minimal impact on training throughput. The findings suggest that non-linear RNNs, particularly M2RNN, are a promising avenue for developing efficient and scalable language models capable of handling complex tasks.
Methodology
The authors developed the M2RNN architecture, which employs matrix-valued hidden states and non-linear state transitions. They conducted empirical evaluations comparing M2RNN with existing models, particularly focusing on language modeling performance, state tracking capabilities, and efficiency in hybrid architectures. The methodology included experiments on various sequence lengths and tasks to assess generalization and performance metrics such as perplexity.
Results
M2RNN achieved perfect state tracking generalization at sequence lengths not seen during training. In hybrid settings, it outperformed Gated DeltaNet hybrids by 0.4-0.5 perplexity points while using 3× smaller state sizes. Additionally, models with a single M2RNN layer showed comparable accuracy gains to full Hybrid M2RNN architectures, particularly excelling in long-context generalization tasks.
Implications
The findings suggest that M2RNN can serve as a foundational building block for future scalable language models, particularly in applications requiring complex state tracking and long-context retrieval. This could lead to advancements in various NLP tasks, including code execution, entity tracking, and more efficient model architectures.
Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding
NLP
Large Language Models
Theory
- Introduces CoQE architecture to separate context and sample encoding.
- Demonstrates that dual representation spaces can alleviate ICL-IWL conflict.
- Provides theoretical and empirical validation of the proposed method.
- Shows improved performance in both in-distribution and out-of-distribution scenarios.
Read more
Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding
Summary
This paper addresses the conflict between in-context learning (ICL) and in-weight learning (IWL) in Transformer models, which has been shown to degrade performance when demonstration examples deviate from training distributions. The authors propose a novel architecture, CoQE, that separates the encoding of context and samples into distinct representation spaces: a task representation space and a sample representation space. By modeling these spaces as dual spaces under a linear representational structure, the authors provide both theoretical and empirical evidence that this separation alleviates the inherent conflict between ICL and IWL. The proposed architecture enhances ICL performance and reconciles both learning capabilities, demonstrating effectiveness in synthetic few-shot classification tasks and a newly designed pseudo-arithmetic task. The findings suggest that distinct representation spaces can improve generalization and task adaptability in Transformer models.
Methodology
The authors modified the Transformer architecture to create two distinct encoding pathways for context and samples, modeling them as dual representation spaces. They employed a linear representation hypothesis and utilized the Riesz representation theorem to compute the model output, facilitating the integration of both learning capabilities.
Results
The CoQE architecture achieved lower ICL error rates in both in-distribution and out-of-distribution settings. It successfully reconciled ICL and IWL capabilities across synthetic few-shot classification tasks and the newly designed pseudo-arithmetic task, demonstrating significant improvements in performance.
Implications
The findings suggest that separating representation spaces can lead to better generalization and adaptability in Transformer models, potentially influencing future research on model architectures and learning mechanisms in various applications, including NLP and beyond.
Bootstrapped Physically-Primed Neural Networks for Robust T2 Distribution Estimation in Low-SNR Pancreatic MRI
Theory
- Introduces a bootstrap-based inference strategy for robust T2 distribution estimation.
- Transforms deterministic T2 relaxometry networks into probabilistic ensemble predictors.
- Demonstrates superior performance in low-SNR conditions compared to traditional methods.
- Achieves significant improvements in clinical differentiation tasks, particularly for T1DM.
Read more
Bootstrapped Physically-Primed Neural Networks for Robust T2 Distribution Estimation in Low-SNR Pancreatic MRI
Summary
This paper addresses the challenge of estimating multi-component T2 relaxation distributions from Multi-Echo Spin Echo (MESE) MRI in the context of low Signal-to-Noise Ratio (SNR) conditions, particularly in pancreatic imaging. Traditional methods, such as regularized non-negative least squares (NNLS), struggle with noise and instability, leading to unreliable estimates. The authors propose a novel bootstrap-based inference framework that enhances the robustness of T2 distribution estimation by treating the echo acquisition as a distribution rather than a fixed input. This method involves stochastic resampling of echo subsets and aggregating predictions to reduce variance and improve the fidelity of estimates. The approach builds on the Physically Primed T2 (P2T2) architecture, transforming deterministic models into probabilistic ensemble predictors. Clinical evaluations demonstrate the method's effectiveness in distinguishing between T1 Diabetes Mellitus (T1DM) and healthy subjects, achieving lower Wasserstein distances and better sensitivity to physiological changes compared to classical NNLS and non-bootstrapped deep learning methods. The results highlight the potential of inference-time bootstrapping to enhance quantitative T2 relaxometry in low-SNR abdominal imaging, paving the way for improved non-invasive pancreatic assessments.
Methodology
The proposed method employs a bootstrap-based inference framework that involves stochastic resampling of echo subsets from the MESE data. By aggregating predictions from these subsets, the method reduces variance and enhances robustness against noise and stochastic errors. This approach is integrated with the P2T2 architecture, which encodes echo-time schedules into the network, allowing for generalization across different MESE protocols.
Results
In experimental evaluations, the bootstrap-enhanced method achieved the lowest Wasserstein distances in a test-retest reproducibility study and demonstrated superior sensitivity in distinguishing between T1DM and healthy subjects. The results indicate a significant improvement in the stability and discriminative power of T2 relaxometry estimates in low-SNR abdominal imaging.
Implications
The findings suggest that the bootstrap-based approach could lead to more reliable non-invasive imaging biomarkers for early detection of pancreatic conditions, including type 1 diabetes and characterization of pancreatic lesions, thereby improving clinical outcomes and reducing the need for invasive procedures.
OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning
Multimodal
Large Language Models
Interpretability
- OMNIFLOW is the first training-free framework for generalized fluid physical reasoning using LLMs.
- It introduces a Semantic-Symbolic Alignment mechanism for better understanding of physical structures.
- The Physics-Guided Chain-of-Thought workflow ensures adherence to physical laws during reasoning.
- Empirical results show superior performance in zero-shot and few-shot tasks compared to traditional models.
Read more
OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning
Summary
The paper introduces OMNIFLOW, a novel neuro-symbolic architecture aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) in the context of scientific problems governed by Partial Differential Equations (PDEs). Traditional LLMs often struggle with continuous spatiotemporal dynamics, leading to non-physical outputs. OMNIFLOW addresses this by grounding LLMs in fundamental physical laws without requiring domain-specific fine-tuning. The architecture employs a Semantic-Symbolic Alignment mechanism that translates high-dimensional flow tensors into topological linguistic descriptors, allowing the model to understand physical structures. Additionally, a Physics-Guided Chain-of-Thought (PG-CoT) workflow is implemented, which incorporates dynamic constraint injection and iterative verification to ensure physical consistency. Evaluations on benchmarks related to microscopic turbulence, Navier-Stokes equations, and global weather forecasting demonstrate that OMNIFLOW significantly outperforms traditional deep learning models in zero-shot generalization and few-shot adaptation tasks, while also providing interpretable reasoning outputs. This marks a shift from black-box models to transparent scientific reasoning, enhancing the interpretability and applicability of LLMs in scientific domains.
Methodology
OMNIFLOW utilizes a neuro-symbolic architecture that integrates a Visual Symbolic Projector to convert raw flow data into semantic tokens, and a Physics-Guided Chain-of-Thought (PG-CoT) mechanism that incorporates dynamic physical constraints and iterative verification to maintain physical consistency in reasoning.
Results
The empirical evaluations indicate that OMNIFLOW achieves prediction accuracy comparable to specialized deep learning models across various benchmarks, demonstrating significant improvements in zero-shot generalization and few-shot adaptation tasks, while also producing interpretable reasoning reports.
Implications
The development of OMNIFLOW has the potential to transform scientific reasoning and decision-making processes by providing interpretable, physically grounded outputs, thus facilitating advancements in fields such as fluid dynamics, meteorology, and other areas governed by complex physical laws.
Physics-integrated neural differentiable modeling for immersed boundary systems
Theory
Efficient ML
Optimization
- Introduces a physics-integrated framework for long-horizon prediction of immersed boundary flows.
- Replaces traditional pressure projection with a learned implicit correction to reduce computational costs.
- Employs a sub-iteration strategy to enhance stability during coarse-grid rollouts.
- Achieves significant improvements in flow-field fidelity and long-horizon stability over existing models.
Read more
Physics-integrated neural differentiable modeling for immersed boundary systems
Summary
This paper presents a novel physics-integrated differentiable modeling framework aimed at improving the long-horizon prediction of immersed boundary flows, which is crucial for accurately simulating complex fluid dynamics near solid boundaries. Traditional numerical solvers often require fine grids and small time steps, leading to high computational costs, while purely data-driven models struggle with error accumulation and lack robustness in extrapolative scenarios. The proposed framework enhances existing neural PDE solvers by integrating physical principles into an end-to-end differentiable architecture. It features a PDE-based intermediate velocity module and a multi-direct forcing immersed boundary module, both designed to comply with the pressure-projection procedure for incompressible flow. A significant innovation is the replacement of the computationally intensive pressure projection step with a learned implicit correction using ConvResNet blocks, which reduces computational costs. Additionally, a sub-iteration strategy is introduced to decouple the stability requirements of the physics module from the surrogate model's time step, enabling stable coarse-grid autoregressive rollouts with larger time increments. The framework is trained using single-step supervision, significantly cutting down training time to under one hour on a single GPU. Evaluations on benchmark cases demonstrate that the proposed model consistently outperforms purely data-driven and physics-loss-constrained models, achieving a remarkable 200-fold speedup in inference compared to high-resolution solvers.
Methodology
The methodology involves developing a differentiable architecture that integrates physical principles into the modeling of fluid flows. It includes a PDE-based intermediate velocity module and a multi-direct forcing immersed boundary module, utilizing ConvResNet blocks for learned implicit corrections and a sub-iteration strategy for stability.
Results
The proposed model was evaluated on benchmark cases, showing superior performance in flow-field fidelity and long-horizon stability compared to purely data-driven and physics-loss-constrained models. It also achieved approximately 200-fold speedup in inference time relative to high-resolution solvers.
Implications
The framework has significant implications for engineering applications requiring efficient and accurate simulations of fluid dynamics, particularly in scenarios involving complex and moving boundaries. It can facilitate faster design iterations and optimizations in fluid control and related fields.
Generative Inverse Design with Abstention via Diagonal Flow Matching
Generative Models
Optimization
Theory
- Introduction of Diagonal Flow Matching (Diag–CFM) to improve stability and accuracy in generative inverse design.
- Development of two novel uncertainty metrics, Zero-Deviation and Self-Consistency, for assessing design reliability.
- Demonstrated significant improvements in round-trip accuracy over existing methods across various design dimensions.
- Practical capabilities include candidate selection, abstention from unreliable predictions, and out-of-distribution detection.
Read more
Generative Inverse Design with Abstention via Diagonal Flow Matching
Summary
This paper addresses the challenges in inverse design, where the goal is to generate design parameters that achieve specified performance targets. Traditional methods often struggle with the sensitivity to the ordering and scaling of design parameters and performance labels. The authors propose a novel approach called Diagonal Flow Matching (Diag–CFM), which utilizes a zero-anchoring strategy to pair design coordinates with noise and labels with zero, ensuring that the learning process is invariant to coordinate permutations. This innovation leads to significant improvements in round-trip accuracy compared to standard Conditional Flow Matching (CFM) and other baseline methods. Additionally, the authors introduce two architecture-specific uncertainty metrics, Zero-Deviation and Self-Consistency, which enhance the model's ability to assess the reliability of generated designs. These metrics enable practical capabilities such as selecting the best design candidates, abstaining from unreliable predictions, and detecting out-of-distribution targets. The effectiveness of the proposed methods is validated through comprehensive evaluations on three design tasks: airfoil aerodynamics, gas turbine combustor design, and a scalable multi-objective benchmark.
Methodology
The authors developed an invertible conditional flow matching architecture that operates bidirectionally, allowing for the generation of designs based on performance labels and vice versa. The Diag–CFM approach employs a zero-anchoring strategy to mitigate sensitivity issues related to coordinate ordering. Uncertainty quantification is achieved through the introduction of two metrics that leverage the bidirectional structure of the model.
Results
The proposed Diag–CFM method achieved order-of-magnitude improvements in round-trip accuracy compared to standard CFM and invertible neural network baselines across design dimensions up to P=100. The uncertainty metrics effectively enabled the selection of reliable design candidates and demonstrated superior performance over ensemble and general-purpose alternatives.
Implications
The advancements presented in this paper have significant implications for engineering design processes, particularly in fields such as aerodynamics and materials science, where efficient exploration of high-dimensional design spaces is crucial. The ability to generate diverse and reliable design candidates can enhance decision-making and innovation in these domains.
Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
Large Language Models
Interpretability
- Introduces SAE-decoded probe steering for behavioral intervention in a 35B MoE model.
- Finds that all five behavioral traits primarily modulate a single agency axis.
- Demonstrates a dissociation between correlation and causal efficacy in behavioral steering.
- Establishes that behavioral commitments are computed during the prefill phase, not during autoregressive decoding.
Read more
Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
Summary
This paper presents a novel approach to behavioral steering in a 35-billion-parameter Mixture-of-Experts (MoE) language model, Qwen 3.5-35B-A3B, using Sparse Autoencoders (SAEs) to identify and manipulate agentic behavioral traits. The author trains nine SAEs on the model's residual stream and employs linear probes on the latent activations to derive continuous steering vectors. This method circumvents the SAE's top-k discretization, allowing for precise behavioral interventions during inference without the need for retraining. The study evaluates the effectiveness of these steering vectors across 1,800 agent rollouts, revealing that autonomy steering significantly enhances the model's proactive behavior, reducing reliance on user assistance. Notably, the analysis uncovers that the five targeted traits predominantly influence a single agency axis, with specific effects manifesting only in tool-use dynamics. The findings also highlight a critical distinction between correlation and causation in behavioral steering, demonstrating that high probe accuracy does not guarantee effective steering. Additionally, the research shows that behavioral commitments are established during the prefill phase of the model's operation, rather than during autoregressive decoding, providing insights into the underlying mechanisms of behavioral decision-making in complex architectures.
Methodology
The methodology involves training nine sparse autoencoders on the residual stream of the Qwen 3.5-35B-A3B model. Linear probes are trained on the latent activations of these SAEs to predict agentic traits, and the probe weights are projected back through the SAE decoder to generate continuous steering vectors in the model's activation space. The effectiveness of these vectors is evaluated through 1,800 agent rollouts across various scenarios.
Results
The results indicate that autonomy steering at a multiplier of α = 2 achieves a significant behavioral shift, with Cohen's d = 1.01 (p < 0.0001), leading to a reduction in user assistance requests from 78% to proactive actions. The analysis reveals that the five traits primarily influence a single agency axis, with distinct causal effects observed for different traits, particularly highlighting the orthogonality of risk calibration and tool-use eagerness despite similar probe accuracies.
Implications
The findings suggest that behavioral steering can be effectively implemented in large-scale language models, enhancing their autonomy and decision-making capabilities. This research has implications for the development of more interactive and responsive AI systems, particularly in applications requiring nuanced behavioral adjustments.
Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction
Time Series
- Laya is the first EEG foundation model utilizing the LeJEPA framework.
- The model focuses on latent prediction instead of signal reconstruction to improve representation learning.
- Laya shows improved performance over traditional reconstruction-based EEG models.
- The use of explicit geometric regularization helps prevent representation collapse.
Read more
Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction
Summary
The paper introduces Laya, the first EEG foundation model based on the LeJEPA framework, which utilizes latent prediction rather than signal reconstruction as its self-supervised learning (SSL) objective. The authors argue that traditional reconstruction-based methods lead to representations biased towards high-variance artifacts, limiting their effectiveness in practical applications. By employing Joint Embedding Predictive Architectures (JEPA) with explicit geometric regularization, Laya aims to learn more transferable and semantically rich EEG representations. The study demonstrates that Laya outperforms existing reconstruction-based models across various EEG benchmarks, indicating that latent predictive objectives can enhance the learning of high-level EEG features. This approach addresses the challenges of generalization and data efficiency in EEG analysis, suggesting a promising direction for future research in EEG representation learning.
Methodology
The authors developed Laya by adapting the LeJEPA framework, which employs Joint Embedding Predictive Architectures (JEPA) to predict latent representations. This method emphasizes learning from predictable structures rather than reconstructing raw EEG signals. The model incorporates explicit geometric regularization to maintain well-conditioned and isotropic representations, thus avoiding common pitfalls associated with representation collapse.
Results
Laya demonstrated superior performance in linear probing tasks compared to existing reconstruction-based EEG models across multiple benchmarks. The results suggest that the latent prediction approach leads to more robust and transferable EEG representations, addressing previous limitations in generalization and data efficiency.
Implications
The findings indicate that latent prediction objectives could significantly enhance the development of EEG analysis tools, improving their applicability in clinical neuroscience and brain-computer interfaces. This approach may lead to better generalization across tasks and subjects, ultimately facilitating the discovery of interpretable biomarkers and advancing the field of EEG research.
Grid-World Representations in Transformers Reflect Predictive Geometry
NLP
Large Language Models
Theory
- Transformers develop internal representations that reflect the geometry of the latent world.
- Optimal prediction in constrained random walks is based on a sufficient vector determined by position and time.
- The study shows strong alignment between learned representations and ground-truth predictive vectors.
- Low-dimensional representations emerge from training on stochastic processes.
Read more
Grid-World Representations in Transformers Reflect Predictive Geometry
Summary
This paper investigates how next-token predictors, specifically transformer models, develop internal representations that reflect the geometry of the latent world and its rules. The authors utilize a controlled setting involving constrained random walks on a two-dimensional lattice, where the objective is to reach a fixed endpoint after a predetermined number of steps. The study reveals that optimal prediction in this stochastic process relies on a sufficient vector defined by the walker's position relative to the target and the remaining time. By training decoder-only transformers on prefixes sampled from the exact distribution of these walks, the authors compare the hidden activations of the models to analytically derived sufficient vectors. The findings indicate a strong alignment between the learned representations and the ground-truth predictive vectors, often resulting in low-dimensional representations. This work provides a concrete example of how world-model-like representations can be traced back to the predictive geometry of the data, suggesting that geometric representations may be crucial for understanding how neural networks internalize grammatical and structural constraints.
Methodology
The authors employed a minimal stochastic process involving constrained random walks on a two-dimensional lattice. They computed the exact joint probability distributions of sequences for various grid walkers and trained decoder-only transformers on samples from these distributions. The hidden activations of the models were then compared to analytically derived sufficient vectors to assess alignment and dimensionality.
Results
The results demonstrated that the learned representations from the transformer models closely aligned with the ground-truth predictive vectors derived from the stochastic process. The representations were often low-dimensional, suggesting that the models effectively captured the underlying geometric structure of the data.
Implications
This research implies that understanding the geometric representations in neural networks could enhance interpretability and provide insights into how models internalize grammatical structures. It may also inform the design of more effective models for natural language processing by leveraging geometric insights.
Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets
Optimization
Large Language Models
Efficient ML
- Quantization of optimizer states can lead to state-update stalling, reducing responsiveness.
- A predictive model for stalling helps understand when optimizer-state resets are beneficial.
- Resetting stale optimizer states can recover performance in low-precision settings.
- The timing of resets is crucial; applying them too early can discard useful averaging.
Read more
Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets
Summary
This paper investigates the effects of quantizing optimizer states during the pre-training of large language models (LLMs), focusing on low-precision exponential moving average (EMA) states. The authors identify a phenomenon called state-update stalling, where nominal updates round back to the same stored value due to quantization, leading to stale states that hinder adaptation. They propose a predictive model to estimate stalling probabilities and characterize the responsiveness of quantized EMAs over time. The study reveals that optimizer-state resets can restore responsiveness when states become stale, emphasizing the importance of timing for these resets. Through controlled simulations and LLM pre-training experiments, the authors demonstrate that appropriate reset schedules can recover performance lost due to low-precision storage while significantly reducing memory usage.
Methodology
The authors developed a predictive model to analyze the dynamics of quantized EMA states, estimating one-step stalling probabilities and characterizing the responsive period post-initialization. They conducted controlled simulations and LLM pre-training experiments to validate their findings regarding the effectiveness of reset schedules.
Results
The study found that suitable reset schedules significantly recover performance lost due to low-precision state storage while also reducing the memory footprint of optimizer states. The predictive model accurately characterized the dynamics of stalling and responsiveness in quantized EMAs.
Implications
The findings suggest that understanding the dynamics of quantized optimizer states can lead to improved training strategies for large language models, particularly in memory-constrained environments. This work highlights the importance of timing in optimizer-state resets, which could enhance the efficiency of large-scale pre-training processes.
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Interpretability
Computer Vision
Efficient ML
- Introduces a rank-one model editing framework for rectifying unreliable neural network behaviors.
- Develops an attribution-guided method for layer localization to identify the most affected layers.
- Achieves effective model rectification with as few as one cleansed sample.
- Demonstrates robustness against neural Trojans, spurious correlations, and feature leakage.
Read more
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Summary
This paper addresses the issue of unreliable behaviors in neural networks, particularly in the context of corrupted samples influenced by neural Trojans and spurious correlations. Traditional methods for rectifying these behaviors often involve labor-intensive data cleaning and model retraining, which can be computationally expensive. The authors propose a novel framework that utilizes rank-one model editing to correct these unreliable behaviors while maintaining overall model performance. They introduce an attribution-guided layer localization method that quantifies the editability of different layers in the model, allowing for targeted corrections. The proposed method is shown to effectively rectify model behaviors with minimal reliance on cleansed samples, achieving significant improvements in model reliability. Extensive experiments validate the effectiveness of the approach across various types of model unreliabilities, demonstrating its practical applicability in real-world scenarios, such as skin lesion analysis.
Methodology
The authors leverage rank-one model editing to establish a rectification framework that corrects unreliable behaviors in neural networks. They introduce an attribution-guided layer localization method to assess layer-wise editability, allowing for targeted corrections. The framework dynamically selects layers for rectification based on the identified sources of unreliability, optimizing the editing process.
Results
The proposed method successfully corrected unreliable behaviors associated with neural Trojans, spurious correlations, and feature leakage. It demonstrated high performance with minimal cleansed samples, achieving effective rectification with just one sample in some cases. Experimental results indicate a significant improvement in model reliability and performance across various datasets.
Implications
This work has significant implications for enhancing the reliability of neural networks in critical applications where model robustness is essential. The proposed framework can be applied to various domains, including security-sensitive areas like medical diagnosis and autonomous systems, where unreliable model behaviors can lead to severe consequences.
DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning
Reinforcement Learning
Federated Learning
Optimization
- DeFRiS enables silo-cooperative scheduling while maintaining data privacy.
- The framework employs an action-space-agnostic policy for seamless knowledge transfer.
- It integrates a silo-optimized local learning mechanism to address sparse delayed rewards.
- The Dual-Track Non-IID aggregation protocol enhances robustness against adversarial threats.
Read more
DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning
Summary
The paper presents DeFRiS, a Decentralized Federated Reinforcement Learning framework designed for efficient scheduling of IoT applications across heterogeneous administrative silos while ensuring data privacy. Traditional scheduling methods often rely on centralized coordination or independent learning, which struggle with the challenges posed by infrastructure heterogeneity, Non-Independent and Identically Distributed (Non-IID) workload shifts, and adversarial environments. DeFRiS addresses these issues through three key innovations: (1) an action-space-agnostic policy that allows knowledge transfer across different silos, (2) a silo-optimized local learning mechanism that combines Generalized Advantage Estimation (GAE) with clipped policy updates to manage sparse delayed rewards, and (3) a Dual-Track Non-IID robust decentralized aggregation protocol that enhances knowledge transfer and anomaly detection. The framework was tested on a distributed testbed with 20 heterogeneous silos, demonstrating significant improvements over existing methods, including reduced response time, lower energy consumption, and enhanced stability in adversarial conditions.
Methodology
DeFRiS utilizes a decentralized peer-to-peer architecture for knowledge exchange among silos. It incorporates an action-space-agnostic policy for resource scoring, a local learning mechanism combining GAE with clipped updates, and a robust aggregation protocol that tracks gradients for optimization and anomaly detection.
Results
The experiments revealed that DeFRiS reduced average response time by 6.4%, energy consumption by 7.2%, and tail latency risk (CVaR0.95) by 10.4% compared to the best-performing baseline. Additionally, it demonstrated over three times better performance retention as the system scales and over eight times better stability in adversarial environments.
Implications
DeFRiS has significant implications for the deployment of IoT applications in environments requiring cooperation among multiple autonomous entities, such as smart cities and industrial supply chains, while ensuring data privacy and robustness against adversarial threats.
Trajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems
Optimization
Theory
Efficient ML
- Introduces trajectory-optimized time reparameterization (TOTR) for ML-ROMs to address stiffness in dynamical systems.
- TOTR formulates time reparameterization as an optimization problem, improving the learnability of neural ODEs.
- Demonstrates significant improvements in training efficiency and prediction accuracy across multiple stiff benchmark problems.
- Achieves loss reductions of one to two orders of magnitude compared to existing time reparameterization methods.
Read more
Trajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems
Summary
This paper addresses the challenges posed by stiff dynamical systems in the context of machine-learning reduced-order models (ML-ROMs). Stiffness complicates explicit time integration, leading to instability, while implicit methods are computationally expensive. The authors propose a novel approach called trajectory-optimized time reparameterization (TOTR), which transforms the time variable to mitigate stiffness and enhance the learnability of neural ordinary differential equations (NODEs) used in ROMs. The TOTR method optimizes the time reparameterization as an arc-length problem, focusing on creating smoother dynamics during training. The effectiveness of TOTR is evaluated against three stiff benchmark problems: a parameterized stiff linear system, the van der Pol oscillator, and the HIRES chemical kinetics model. The results demonstrate that TOTR yields significantly smoother reparameterizations and improved predictions in physical time, achieving loss reductions of one to two orders of magnitude compared to existing methods, especially in highly stiff regimes. This work underscores the importance of the regularity of the time map in the performance of ML-ROMs and establishes optimization-based TR as a robust framework for modeling multiscale dynamical systems.
Methodology
The authors developed the TOTR approach by framing time reparameterization as an optimization problem in arc-length coordinates. This method selects a traversal-speed profile that penalizes acceleration in the transformed time, aiming to produce smoother dynamics during training. The effectiveness of TOTR was assessed through numerical experiments on three stiff dynamical systems, comparing its performance against existing TR strategies.
Results
TOTR consistently outperformed other time reparameterization methods across all tested stiff benchmark problems, yielding smoother trajectories and improved physical-time predictions. The quantitative results indicated loss reductions of one to two orders of magnitude, particularly in scenarios characterized by severe stiffness.
Implications
The findings suggest that optimizing time reparameterization can significantly enhance the performance of ML-ROMs in simulating stiff dynamical systems. This approach could be applied in various fields such as chemical kinetics, fluid dynamics, and control systems, where stiffness presents a major computational challenge.
Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods
NLP
Large Language Models
- Persistent memory can be integrated into frozen LLMs, allowing for information retention across sessions.
- The study introduces six architectural methods for memory management, focusing on differentiable operations within the model's latent space.
- Memory capacity significantly impacts the performance of the memory systems, with higher capacities leading to better recall.
- The concept of conversational learning is introduced, where the model accumulates knowledge over multiple sessions.
Read more
Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods
Summary
This paper explores the integration of persistent memory into frozen encoder-decoder language models (LLMs), specifically focusing on the Flan-T5-XL architecture. Traditional frozen LLMs lack the ability to retain information across sessions, leading to stateless interactions. The author presents a proof-of-concept study demonstrating that it is feasible to implement persistent memory in the continuous latent space of a frozen LLM, even under significant resource constraints. The study introduces six architectural methods that utilize small trainable adapters to facilitate memory writing and reading within the model's latent space. These methods are evaluated based on their ability to accumulate and recall information across sessions, revealing that memory capacity is a critical design parameter. The results indicate that, while the stateless baseline performs poorly, the trained adapters show positive memory-recall capabilities, particularly at higher memory capacities. The paper argues for the potential of larger-scale implementations to yield even stronger results, establishing a foundation for future research in persistent memory systems for LLMs.
Methodology
The paper implements six architectural methods that span three injection points and four write mechanisms for integrating persistent memory into a frozen LLM. These methods involve training a small memory adapter while keeping the encoder and decoder frozen, allowing for differentiable memory operations during the forward pass. The evaluation is conducted using a forgetting-curve analysis on a specific dataset, measuring the memory recall capabilities of each method.
Results
The evaluation results show that at a 10× memory capacity, all six trained adapters outperform the stateless baseline, which scores zero. At a 1× capacity, three methods fail to perform, highlighting the importance of memory capacity in the design. The methods M.2 XAttn and M.6 Slot excel at low capacity, while M.4 Hebbian performs best at high capacity, indicating varying effectiveness based on memory size.
Implications
The findings suggest that integrating persistent memory into LLMs can significantly enhance their conversational capabilities, enabling them to recall information across sessions. This has potential applications in developing more interactive and context-aware AI systems, particularly in areas such as customer service, personal assistants, and educational tools.
MBD: A Model-Based Debiasing Framework Across User, Content, and Model Dimensions
Theory
Optimization
Efficient ML
- Introduction of a generalized debiasing framework that shifts from point-wise error minimization to distributional bias mitigation.
- Utilization of a dual-prediction model architecture that integrates distribution modeling without requiring separate serving infrastructure.
- Demonstrated effectiveness through large-scale deployment, improving long-term retention and engagement metrics.
- Flexible definition of 'unbiasedness' allows for adaptation to various personalization objectives.
Read more
MBD: A Model-Based Debiasing Framework Across User, Content, and Model Dimensions
Summary
The paper presents a Model-Based Debiasing (MBD) framework aimed at addressing biases in recommendation systems, particularly in the context of short-form video content. Traditional recommendation systems often aggregate behavioral signals that are influenced by various biases, leading to misalignment between model predictions and user preferences. The MBD framework proposes a solution by transforming biased signals into unbiased representations through distributional modeling. It estimates the contextual mean and variance of engagement distributions based on a flexible subset of features, allowing for personalized and adaptive debiasing. This approach enables the generation of calibrated signals, such as percentiles or z-scores, which are more reflective of user interests. The implementation of MBD is integrated into existing multi-task multi-label ranking models, requiring minimal additional infrastructure. The effectiveness of the framework is validated through large-scale deployment and rigorous online A/B testing, demonstrating significant improvements in user engagement metrics, such as time spent and session duration, by effectively decoupling preference signals from intrinsic biases.
Methodology
The MBD framework employs a dual-prediction model architecture that estimates the contextual mean and variance of engagement distributions. It utilizes a flexible partial feature set to transform biased predictions into calibrated signals, allowing for a more accurate representation of user preferences. The framework is designed to operate within existing multi-task multi-label models, minimizing engineering overhead.
Results
The empirical results from large-scale A/B testing indicate that the MBD framework significantly enhances user engagement metrics, including increased time spent and longer session durations, by effectively decoupling user preferences from intrinsic biases present in the recommendation system.
Implications
The MBD framework has the potential to improve recommendation systems across various platforms by providing a more accurate representation of user interests, leading to better user experiences and increased engagement. It can be applied to various content types beyond short-form videos, making it a versatile tool for addressing biases in digital content distribution.
When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making
Large Language Models
NLP
Theory
- Stability in LLM outputs does not ensure correctness in scientific decision-making.
- LLMs can produce consistent outputs while diverging from statistical ground truths.
- Minor changes in prompt wording can significantly affect LLM outputs.
- Relaxed statistical thresholds may lead to over-selection of gene candidates.
Read more
When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making
Summary
This paper investigates the reliability of large language models (LLMs) as decision-support tools in data-constrained scientific workflows, particularly in the context of gene prioritization tasks derived from differential expression analysis. The author critiques the common evaluation practice that emphasizes stability across repeated runs, arguing that stability does not guarantee correctness or validity when compared to established statistical ground truths. A controlled behavioral evaluation framework is introduced, separating four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity. The study evaluates multiple LLMs, including ChatGPT, Google Gemini, and Claude, across various prompt regimes that manipulate statistical thresholds and wording. Results reveal that while LLMs can exhibit high stability, they may diverge significantly from statistical references, over-select genes under relaxed thresholds, and produce invalid gene identifiers. The findings underscore the necessity for explicit validation against ground truth in scientific applications of LLMs, suggesting that stability should be viewed as a complementary diagnostic rather than a proxy for correctness.
Methodology
The study employs a controlled evaluation framework to assess LLM behavior in a statistical gene prioritization task. It uses a fixed differential expression table as input and evaluates LLMs across different prompt regimes. The evaluation measures stability, correctness against a statistical reference, prompt sensitivity, and output validity through repeated runs and various statistical thresholds.
Results
The experiments demonstrate that LLMs can achieve high run-to-run stability while showing low agreement with statistical references. They also reveal that small prompt variations can lead to significant differences in gene prioritization outcomes, and that relaxed thresholds result in over-selection of genes. Additionally, the models may generate plausible but invalid gene identifiers not present in the input data.
Implications
The findings highlight the limitations of relying solely on stability as an indicator of model reliability in scientific contexts. They suggest that researchers and practitioners should incorporate explicit validation against statistical ground truths when using LLMs for decision-making in data-constrained environments.
HIPO: Instruction Hierarchy via Constrained Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- HIPO formulates instruction hierarchy as a Constrained Markov Decision Process (CMDP), a novel approach in this domain.
- The algorithm employs a primal-dual safe reinforcement learning method to ensure system prompt compliance while optimizing user utility.
- Extensive experiments show that HIPO significantly enhances compliance and utility across various LLM architectures.
- Attention analysis indicates that HIPO effectively reallocates model attention towards system instruction tokens.
Read more
HIPO: Instruction Hierarchy via Constrained Reinforcement Learning
Summary
The paper introduces HIPO, a novel framework for Hierarchical Instruction Following (HIF) that addresses the challenges of prompting large language models (LLMs) with a priority-ordered stack of instructions. Traditional methods like Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) struggle with HIF as they optimize for a single objective and fail to enforce system prompt compliance. HIPO formulates HIF as a Constrained Markov Decision Process (CMDP), elevating system prompts to strict algorithmic boundaries. The algorithm employs a primal-dual safe reinforcement learning approach to dynamically enforce compliance while maximizing user utility. Extensive evaluations across various model architectures, including Qwen, Phi, and Llama, demonstrate significant improvements in both system compliance and user utility. Mechanistic analysis reveals that HIPO encourages models to focus more on long-range system tokens, providing a robust foundation for deploying LLMs in complex workflows.
Methodology
HIPO utilizes a constrained optimization framework, treating system prompt compliance as an explicit constraint within a CMDP formulation. It employs a primal-dual safe reinforcement learning approach combined with a group-based policy gradient mechanism to ensure compliance while maximizing user utility. The evaluation protocol includes separate reward functions for compliance and utility, allowing for a nuanced assessment of the model's performance.
Results
HIPO demonstrated consistent improvements in system compliance and user utility across multiple model families and sizes, with attention analysis confirming that the model learns to prioritize system instruction tokens effectively. The results indicate that HIPO outperforms baseline algorithms in both compliance and utility metrics.
Implications
The HIPO framework has significant implications for the deployment of LLMs in complex workflows, ensuring that models can adhere to strict guidelines while still effectively addressing user requests. This approach could enhance the reliability and safety of LLM applications in various domains, including customer service, content generation, and automated decision-making.
Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards
Reinforcement Learning
Large Language Models
- Noisy data significantly degrades the performance of RLVR models.
- Existing algorithmic improvements fail to mitigate the impact of data noise.
- Training on 100% incorrect annotations leads to performance similar to format-only rewards.
- Real-world annotation errors in Text2SQL tasks further illustrate the destructive impact of noise.
Read more
Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards
Summary
This paper critically examines the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) against noisy data, challenging the prevailing notion that RLVR can effectively learn from incorrect annotations. The authors argue that previous studies claiming high performance from noisy data were flawed due to contamination with clean data. By implementing a rigorous re-verification pipeline, they demonstrate that noise significantly degrades RLVR performance. The study reveals that existing algorithmic improvements do not mitigate the negative effects of noise, with models trained on truly incorrect annotations performing 8-10% worse than those trained on clean data across various benchmarks, including mathematical reasoning tasks. Additionally, the authors investigate real-world annotation errors in Text2SQL tasks, finding that training on noisy datasets results in 5-12% lower accuracy compared to clean datasets. The findings underscore the necessity of high-quality data in RLVR, contradicting the belief that algorithmic enhancements can compensate for poor data quality.
Methodology
The authors constructed a truly noisy dataset for mathematical reasoning and employed a re-verification pipeline to assess the quality of annotations. They compared the performance of models trained on clean versus noisy data, including synthetic and real-world noise scenarios, and evaluated various state-of-the-art RLVR algorithms to determine their effectiveness in handling noise.
Results
The study found that models trained on 100% incorrect annotations performed 8-10% worse than those trained on clean data. Even with the best-performing algorithm, training on 50% incorrect annotations yielded performance similar to the basic GRPO method and underperformed compared to clean data. In real-world Text2SQL tasks, training on noisy datasets resulted in a 5-12% decrease in accuracy compared to cleaned datasets.
Implications
The findings suggest that reliance on noisy data in RLVR could lead to suboptimal model performance, emphasizing the importance of data quality in machine learning applications. This has implications for future research and practical applications in RLVR, particularly in domains where data quality cannot be guaranteed.
The Importance of Being Smoothly Calibrated
Theory
- Introduces a new omniprediction guarantee for smoothly calibrated predictors.
- Characterizes smooth calibration in terms of earth mover's distance to perfect calibration.
- Demonstrates that upper distance to calibration cannot be estimated within a quadratic factor.
- Unifies and extends previous results on omniprediction from smooth calibration.
Read more
The Importance of Being Smoothly Calibrated
Summary
This paper emphasizes the significance of smooth calibration as a robust measure of calibration error in machine learning predictions. The authors generalize and unify previous findings on smooth calibration, introducing a new omniprediction guarantee for smoothly calibrated predictors across all bounded proper losses. By adding noise to the predictor, they demonstrate that the omniprediction error is bounded by the smooth calibration error and the earth mover's distance from a benchmark predictor. The paper also provides a new characterization of smooth calibration in relation to the earth mover's distance to the nearest perfectly calibrated distribution, simplifying previous proofs. Furthermore, it highlights the limitations of estimating the upper distance to calibration, contrasting it with the known impossibility of estimating the distance to calibration with a finite sample size. Overall, the work advances the understanding of calibration measures and their implications for decision-making in machine learning.
Methodology
The authors employ theoretical analysis to derive new guarantees for omniprediction from smoothly calibrated predictors. They introduce noise to the predictions to establish bounds on omniprediction error and utilize the earth mover's distance to characterize smooth calibration. The paper also discusses the sample complexity required for estimating calibration distances.
Results
The main results include a new omniprediction guarantee that relates the omniprediction error to the smooth calibration error and the earth mover's distance from a benchmark. The authors also provide a clear characterization of smooth calibration and demonstrate the limitations of estimating upper distances to calibration, establishing that it cannot be done within a quadratic factor.
Implications
The findings have significant implications for the development of robust machine learning models that require reliable calibration. The results can enhance decision-making processes in various applications by ensuring that predictions are trustworthy, particularly in critical domains like healthcare and finance where decision-making relies heavily on calibrated predictions.
Residual Stream Duality in Modern Transformer Architectures
NLP
Large Language Models
Theory
- Residual pathways in Transformers are crucial for representation, not just optimization.
- A depth-wise residual attention read is equivalent to ShortSWA on the depth axis.
- Existing models like ELC-BERT and DenseFormer illustrate the benefits of learned aggregation over depth.
- Sequence-axis ShortSWA is generally more hardware-efficient than depth-axis aggregation.
Read more
Residual Stream Duality in Modern Transformer Architectures
Summary
This paper explores the role of the residual stream in Transformer architectures, arguing that it is integral to the model's representational capabilities rather than merely serving as an optimization mechanism. The author proposes a two-axis framework for understanding Transformers, where information evolves along the sequence position and layer depth. The paper introduces the concept of residual stream duality, demonstrating that a depth-wise residual attention read can be mathematically equivalent to causal short sliding-window attention (ShortSWA) when viewed through the lens of layer depth. This perspective clarifies existing literature on various models that utilize learned aggregation over depth, such as ELC-BERT and DenseFormer, and highlights the importance of adaptive attention mechanisms. The author emphasizes that while depth-wise attention offers a new local operator, it does not imply systems-level symmetry, as sequence-axis ShortSWA is often more efficient for hardware utilization. The paper concludes with recommendations for using Deep Delta Learning (DDL) to modify shortcuts directly when necessary, and employing sequence-axis ShortSWA for local adaptive mixing.
Methodology
The paper employs a theoretical framework to analyze the duality of residual streams in Transformers, comparing depth-wise residual attention with sequence-axis ShortSWA. It reviews existing literature and models that implement learned aggregation over depth, providing a conceptual basis for the proposed duality.
Results
The analysis reveals that depth-wise residual attention can be treated as a local operator akin to ShortSWA, leading to insights about the efficiency and design choices in Transformer architectures. The paper identifies that while both approaches have their merits, sequence-axis ShortSWA is often more compatible with current hardware setups.
Implications
The findings suggest that Transformer architectures can be optimized by reconsidering how residual streams are designed and utilized, potentially leading to more efficient models in NLP and other applications. The recommendations for using DDL and ShortSWA could influence future architectural designs in large-scale models.
Learning Lineage-guided Geodesics with Finsler Geometry
Time Series
Theory
Optimization
- Introduction of a Finsler metric for trajectory inference that incorporates lineage information.
- Formal proof of the well-defined local geometry induced by the proposed metric.
- Demonstration of improved accuracy in trajectory interpolation tasks using the new metric.
- Integration of discrete and directed priors enhances the modeling of biological systems.
Read more
Learning Lineage-guided Geodesics with Finsler Geometry
Summary
This paper addresses the challenge of trajectory inference in dynamical systems, particularly in contexts where data is observed at discrete time points. The authors propose a novel approach that integrates Finsler geometry to incorporate both continuous geometric priors and discrete, directed lineage information into the trajectory inference process. Traditional methods have primarily relied on Riemannian metrics, which may not adequately capture the structured transitions inherent in biological systems, such as lineage trees in developmental biology. The authors define a Finsler metric conditioned on a directed adjacency matrix that represents admissible transitions, allowing for the enforcement of biologically plausible trajectories. They provide formal proofs demonstrating that this metric induces a well-defined local geometric structure and can be combined with existing geometric priors to enhance trajectory accuracy. The proposed method shows improved performance on interpolation tasks across both synthetic and real-world datasets, particularly in single-cell RNA sequencing applications.
Methodology
The authors develop a Finsler metric that combines geometric and classification information, allowing for the incorporation of directed lineage priors into trajectory inference. They provide a formal mathematical framework to establish the properties of this metric and demonstrate its application in reconstructing continuous paths from discrete observations.
Results
The proposed Finsler metric significantly improves the accuracy of trajectory inference in both synthetic and real-world datasets, particularly in scenarios involving single-cell RNA sequencing. The integration of lineage information leads to more biologically plausible trajectories compared to traditional Riemannian approaches.
Implications
This work has potential implications for various fields, including developmental biology and single-cell analysis, where understanding the dynamics of cell differentiation and evolution is crucial. The methodology can enhance the interpretation of temporal data in biological systems and may be applicable to other domains requiring trajectory inference.
Mechanistic Foundations of Goal-Directed Control
Robotics
Theory
Interpretability
- Extends mechanistic interpretability to embodied control systems using infant motor learning as a model.
- Identifies critical parameters influencing the formation of causal control circuits.
- Demonstrates a clean phase transition in arbitration mechanisms during learning.
- Establishes a two-dimensional phase diagram for task-dependent route arbitration.
Read more
Mechanistic Foundations of Goal-Directed Control
Summary
This paper extends the framework of mechanistic interpretability, traditionally applied to transformer circuits, to the domain of embodied control systems, specifically focusing on infant motor learning. The author demonstrates that foundational inductive biases lead to the formation of causal control circuits, where learned gating mechanisms align with theoretically defined uncertainty thresholds. A significant finding is the identification of a clean phase transition in the arbitration gate, characterized by a closed-form exponential moving-average surrogate. The study highlights the critical parameter of context window k, which influences circuit formation and gate confidence. Below a threshold (k≤4), the arbitration mechanism fails to form, while above another threshold (k≥8), gate confidence scales logarithmically with k. Additionally, a two-dimensional phase diagram illustrates task-demand-dependent route arbitration, suggesting that prospective execution is beneficial only when prediction errors are within acceptable limits. Overall, the findings provide a mechanistic understanding of how reactive and prospective control strategies develop and compete during learning, contributing to cognitive development theories and offering insights for designing interpretable embodied agents.
Methodology
The study employs a computational modeling approach to analyze the dynamics of infant motor learning, focusing on the formation of causal control circuits and the influence of inductive biases. It utilizes phase diagrams and closed-form predictions to describe the behavior of gating mechanisms and their transitions during learning.
Results
The research reveals that foundational inductive biases lead to the emergence of causal control circuits with learned gating mechanisms that align with uncertainty thresholds. It identifies critical context window parameters that affect the formation of arbitration mechanisms and demonstrates how reactive and prospective control strategies compete and evolve during the learning process.
Implications
The findings have significant implications for understanding cognitive development in infants and can inform the design of more resilient and context-sensitive embodied AI systems. The mechanistic insights gained may enhance the interpretability of AI models and guide future research in cognitive robotics.
Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
Multimodal
Time Series
Optimization
- Introduces Baguan-solar, a two-stage multimodal framework for solar irradiance forecasting.
- Combines global weather foundation model forecasts with high-resolution satellite imagery.
- Achieves a 16.08% reduction in RMSE compared to strong baseline models.
- Effectively resolves fine-scale cloud structures and improves long-term forecasting accuracy.
Read more
Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
Summary
The paper addresses the challenge of accurate day-ahead solar irradiance forecasting, which is crucial for integrating solar energy into the power grid. Traditional methods struggle with either low resolution or performance degradation over longer lead times. The authors propose a novel two-stage multimodal framework named Baguan-solar, which combines forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery. This framework produces 24-hour irradiance forecasts at a kilometer scale. The first stage forecasts continuous intermediates like cloud cover, while the second stage infers irradiance, effectively merging fine-scale cloud structures from satellite data with large-scale environmental constraints from Baguan. Evaluated over East Asia, Baguan-solar demonstrates significant improvements over existing methods, reducing RMSE by 16.08% and better capturing cloud-induced transients. The operational deployment of Baguan-solar has been in place since July 2025, supporting solar power forecasting in an eastern province of China.
Methodology
The methodology involves a two-stage process where the first stage forecasts day-night continuous intermediates such as cloud cover, and the second stage infers solar irradiance. The framework integrates data from the Baguan weather foundation model and high-resolution satellite imagery, utilizing a decoupled design that preserves both fine-scale cloud structures and large-scale atmospheric constraints.
Results
Baguan-solar outperformed several strong baseline models, including ECMWF IFS and SolarSeer, achieving a 16.08% reduction in RMSE. The framework also demonstrated improved resolution of cloud-induced transients, which are critical for accurate solar irradiance forecasting.
Implications
The successful implementation of Baguan-solar has significant implications for the integration of solar energy into power grids, enhancing the reliability and efficiency of solar power forecasting. This framework could be adapted for other meteorological forecasting applications, potentially improving energy management and grid stability.
Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold
Optimization
Theory
- Introduction of the Spectral Gating mechanism that regulates the transition from memorization to generalization.
- Identification of a stability condition that constrains grokking, requiring accumulated gradient variance to access the generalizing solution.
- Categorization of three complexity regimes that describe different learning dynamics.
- Refutation of the 'Flat Minima' hypothesis, emphasizing the role of anisotropic noise in achieving generalization.
Read more
Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold
Summary
This paper addresses the phenomenon of grokking, where neural networks exhibit generalization long after training convergence, challenging traditional optimization theories. The authors analyze the dynamics of the AdamW optimizer on modular arithmetic tasks, introducing a 'Spectral Gating' mechanism that governs the transition from memorization to generalization. They identify that grokking is constrained by a stability condition, where the generalizing solution exists in a sharp basin that is initially inaccessible under low-variance conditions. The study categorizes three complexity regimes: Capacity Collapse, Variance-Limited Regime, and Stability Override, each reflecting different stages of learning and generalization. The authors also refute the 'Flat Minima' hypothesis, demonstrating that anisotropic noise rectification by adaptive optimizers is essential for achieving grokking, as opposed to isotropic noise injection. Overall, the paper provides a unified framework that connects geometric and stability perspectives in understanding grokking dynamics.
Methodology
The authors model the training dynamics of the AdamW optimizer as a continuous stochastic process, treating its updates as a Stochastic Differential Equation (SDE). They analyze the interaction between noise and geometry in the loss landscape, focusing on how the optimizer's internal noise structure influences the learning process.
Results
The study reveals that grokking occurs when the accumulated gradient variance allows the optimizer to enter a high-curvature geometry of the solution space. The authors find that below a certain complexity threshold, structural learning is hindered, and generalization is delayed until the spectral gate opens. Their experiments demonstrate that anisotropic noise rectification is critical for achieving generalization, contrasting with isotropic noise methods.
Implications
The findings have significant implications for the design of optimizers in machine learning, suggesting that adaptive optimizers with anisotropic noise structures may be more effective in promoting generalization in complex tasks. This understanding could enhance the training of large foundation models and improve their reasoning capabilities post-training convergence.
Decoding the Critique Mechanism in Large Reasoning Models
Large Language Models
Interpretability
Theory
- LRMs exhibit a hidden critique ability that allows for self-correction despite errors in intermediate reasoning.
- Injecting arithmetic mistakes reveals that LRMs can still produce correct final answers, indicating internal error detection mechanisms.
- A critique vector is identified that captures the model's ability to detect and correct errors without explicit verbalization.
- Steering the critique vector improves the model's performance on error detection tasks without requiring extra training.
Read more
Decoding the Critique Mechanism in Large Reasoning Models
Summary
This paper investigates the critique mechanism in Large Reasoning Models (LRMs), focusing on their ability to detect and correct errors during reasoning processes. The authors hypothesize that LRMs possess a hidden critique ability that allows them to recover from mistakes made in intermediate reasoning steps. By deliberately injecting arithmetic errors into the reasoning process, the study reveals that LRMs can still arrive at the correct final answers despite incorrect intermediate conclusions. This phenomenon indicates an internal mechanism for error detection and self-correction, which the authors refer to as the hidden critique ability. Through feature space analysis, the researchers identify a critique vector that represents this behavior, demonstrating that steering latent representations with this vector enhances error detection capabilities without additional training costs. The findings provide insights into improving the self-verification mechanisms of LRMs, suggesting new avenues for enhancing their reasoning performance.
Methodology
The authors injected arithmetic errors into the reasoning processes of LRMs and analyzed the resulting outputs to assess the models' recovery capabilities. They conducted feature space analysis to identify a critique vector representing the hidden critique ability and evaluated the impact of steering this vector on the models' performance in error detection tasks.
Results
The experiments demonstrated that LRMs often recovered from injected errors and produced correct final answers. The critique vector was shown to effectively enhance the models' error detection capabilities, leading to improved performance on various tasks without additional training costs.
Implications
The findings suggest that enhancing the critique mechanism in LRMs could lead to more reliable reasoning processes, reducing the occurrence of errors in complex tasks. This could have significant implications for applications requiring high accuracy in logical reasoning, such as automated decision-making systems and advanced AI applications.
A federated learning framework with knowledge graph and temporal transformer for early sepsis prediction in multi-center ICUs
Federated Learning
Graph Learning
Time Series
- Integration of federated learning with knowledge graphs and temporal transformers for sepsis prediction.
- Preservation of patient privacy through decentralized model training without sharing raw data.
- Significant performance improvements over traditional centralized and federated learning models.
- Utilization of a knowledge graph to enhance model interpretability and capture complex clinical relationships.
Read more
A federated learning framework with knowledge graph and temporal transformer for early sepsis prediction in multi-center ICUs
Summary
This paper presents a novel framework for early sepsis prediction in intensive care units (ICUs) by integrating federated learning (FL) with a medical knowledge graph and a temporal transformer model, enhanced by meta-learning capabilities. The authors address the challenges of data fragmentation across healthcare institutions and the complex temporal nature of medical data while adhering to privacy constraints. The proposed framework allows for collaborative model training across multiple hospitals without sharing raw patient data, thus preserving privacy. It utilizes a knowledge graph to incorporate structured medical relationships and employs a temporal transformer to capture long-range dependencies in clinical time-series data. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to facilitate rapid adaptation of the global model to local data distributions. The framework was evaluated on the MIMIC-IV and eICU datasets, achieving an area under the curve (AUC) of 0.956, which represents significant improvements over conventional centralized models and standard federated learning approaches. This work provides a reliable and privacy-preserving solution for multi-center collaborative early warning of sepsis.
Methodology
The methodology involves a federated learning framework that combines a medical knowledge graph for structured relationships, a temporal transformer for capturing long-range dependencies in time-series data, and a model-agnostic meta-learning strategy for adapting the global model to local data. The knowledge graph is constructed from established medical ontologies, and the temporal transformer employs self-attention mechanisms to handle irregular sampling rates and missing values in clinical data.
Results
The proposed framework achieved an AUC of 0.956 on the MIMIC-IV and eICU datasets, indicating a 22.4% improvement over conventional centralized models and a 12.7% improvement over standard federated learning approaches, demonstrating its strong predictive capability for early sepsis detection.
Implications
The findings suggest that this framework can significantly enhance early sepsis prediction in multi-center ICUs while maintaining patient privacy. It opens avenues for collaborative healthcare analytics without compromising sensitive data, potentially leading to improved patient outcomes and survival rates in critical care settings.
PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing
Graph Learning
- Introduction of PiGRAND, a physics-informed graph neural diffusion framework for heat transport modeling.
- Development of an efficient graph construction method for transforming thermal images into graph data.
- Integration of physical principles and sub-learning models to enhance prediction accuracy.
- Significant improvements in computational performance and accuracy over traditional methods.
Read more
PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing
Summary
The paper introduces PiGRAND, a novel framework that integrates physics-informed graph neural networks to model heat transport in additive manufacturing processes, particularly 3D printing. The authors highlight the challenges of accurately modeling heat transfer due to its dynamic and non-linear nature, which is critical for optimizing production quality. Traditional numerical methods like finite element and finite volume methods, while accurate, are computationally intensive. PiGRAND addresses these limitations by proposing an efficient graph construction method that transforms thermal images into graph-structured data, facilitating better representation of spatial relationships. The framework incorporates explicit Euler and implicit Crank-Nicolson methods for continuous heat transport modeling and introduces sub-learning models to enhance diffusion accuracy across nodes. Additionally, the use of transfer learning is emphasized to improve computational efficiency and reduce retraining costs. The evaluation of PiGRAND on thermal images from 3D printing demonstrates significant improvements in prediction accuracy and computational performance compared to existing methods like GRAND and physics-informed neural networks (PINNs). The integration of physical principles into the learning model is shown to be a key factor in these enhancements.
Methodology
The methodology involves creating a graph structure from thermal images, utilizing explicit Euler and implicit Crank-Nicolson methods for modeling heat transport, and incorporating sub-learning models for connectivity and energy dissipation. The framework also employs transfer learning to leverage pre-trained models, enhancing computational efficiency.
Results
The evaluation of PiGRAND on thermal images from 3D printing processes shows substantial improvements in prediction accuracy and computational performance compared to traditional GRAND and PINNs, attributed to the effective integration of physical principles and innovative graph learning techniques.
Implications
The advancements presented in PiGRAND have significant implications for optimizing additive manufacturing processes, potentially leading to improved production quality and efficiency in various engineering applications. The framework's open-source nature encourages further exploration and adaptation in related fields.
SemRep: Generative Code Representation Learning with Code Transformations
Large Language Models
Generative Models
Optimization
- SEMREP improves code transformation by learning generative code representations.
- The framework utilizes semantics-preserving transformations as intermediate representations.
- SEMREP outperforms existing methods in correctness, performance, generalization, and robustness.
- The approach is particularly effective for evolutionary search in code optimization.
Read more
SemRep: Generative Code Representation Learning with Code Transformations
Summary
The paper introduces SEMREP, a novel framework aimed at enhancing code transformation processes through generative code representation learning. Traditional approaches to code transformation often treat it as an end-to-end learning task, which can lead to suboptimal representations and reliance on rigid compiler abstractions. SEMREP addresses this by utilizing semantics-preserving transformations as intermediate representations, which serve both as a generative mid-training task and guidance for specific code transformations. The framework allows for the generation of semantically equivalent code snippets, which can then be transformed according to specific instructions. This dual approach not only improves the quality of code transformations but also enables better exploration of diverse code transformations. The authors demonstrate that SEMREP outperforms existing fine-tuned baselines in various metrics, including correctness, performance, generalization, and robustness, while also being more efficient in terms of computational resources. The framework's ability to effectively explore code transformations makes it particularly suitable for evolutionary search applications, leading to optimizations that larger-weight baselines fail to achieve.
Methodology
SEMREP employs a two-step process where it first generates semantically equivalent code snippets and then applies transformations based on user instructions. This is achieved through a mid-training task that focuses on generative code representation learning, allowing the model to explicitly learn and reason about code semantics.
Results
SEMREP demonstrates a 6.9% improvement in correctness, 1.1× better performance, 13.9% enhancement in generalization, and 6.7% increase in robustness compared to extensively fine-tuned baselines, all while using the same training budget. Additionally, it achieves the same performance as larger-weight models with 25% less inference compute.
Implications
The findings suggest that SEMREP could significantly enhance automated code transformation tools, making them more effective in software development and maintenance. Its ability to discover optimizations not found by larger models indicates potential for broader applications in software engineering, particularly in optimizing GPU kernels and other performance-critical code.
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization
NLP
Large Language Models
Efficient ML
- Plume is a 140M-parameter foundation model tailored for 802.11 wireless traces, emphasizing protocol-aware tokenization.
- The model achieves higher accuracy and efficiency compared to larger generalist LLMs, with >600× fewer parameters.
- A novel protocol- and timing-aware tokenizer enhances information density and reduces sequence length significantly.
- Plume supports on-premises deployment, ensuring privacy and facilitating root cause analysis without reliance on external APIs.
Read more
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization
Summary
The paper introduces Plume, a compact 140M-parameter foundation model specifically designed for analyzing 802.11 wireless packet traces. Unlike traditional models that treat packet data as flat strings, Plume utilizes a protocol-aware tokenizer that respects the structured nature of wireless data, including layered headers and timing gaps. This tokenizer generates significantly shorter sequences with higher information density compared to standard methods like Byte Pair Encoding (BPE). Plume is trained on a curated dataset and demonstrates impressive performance, achieving 74-97% next-packet token accuracy across various failure categories and an area under the receiver operating characteristic curve (AUROC) of ≥0.99 for zero-shot anomaly detection. The model's compact size allows it to run on-premises, preserving privacy and enabling efficient root cause analysis without the need for cloud-based APIs. The authors emphasize the importance of structured data representation and quality in training, proposing a proactive data capture strategy that enhances the model's effectiveness. Overall, Plume represents a significant advancement in the field of network analysis, providing a specialized tool that aligns closely with the inherent structure of wireless communication protocols.
Methodology
Plume employs a protocol-aware tokenizer that splits packet data according to the structured hierarchy of protocol fields, incorporates gap tokens for timing, and normalizes identifiers. The model is trained on a curated dataset derived from structured PDML dissections, utilizing advanced sampling techniques to ensure high-quality training data. The architecture is designed to operate efficiently on a single GPU, enabling real-time analysis of wireless traces.
Results
Plume achieves 74-97% accuracy in next-packet token prediction across five real-world failure categories and demonstrates an AUROC of ≥0.99 for zero-shot anomaly detection. The model's compact size allows it to process approximately 200 packets per second on a single NVIDIA A10G GPU, significantly reducing operational costs compared to cloud-based solutions.
Implications
Plume's design allows for effective on-premises analysis of wireless network data, making it suitable for environments with strict data privacy requirements. Its ability to accurately predict packet behavior and detect anomalies can enhance network management and troubleshooting processes, leading to improved reliability and performance in wireless communications.
Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks
Federated Learning
Optimization
Efficient ML
- Introduction of a joint routing and model pruning framework for D-FL.
- Establishment of convergence guarantees under communication constraints.
- Development of a routing algorithm that improves multi-hop transmission efficiency.
- Significant reductions in transmission latency and improvements in model accuracy demonstrated through simulations.
Read more
Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks
Summary
This paper addresses the challenges of decentralized federated learning (D-FL) in bandwidth-constrained multi-hop wireless networks by proposing a novel joint routing and model pruning framework. The authors analyze the impact of model biases on D-FL convergence and formulate an optimization problem that maximizes model retention rates while adhering to communication constraints. The framework allows clients to optimize their routing paths and pruning rates based on latency and bandwidth limitations. The study reveals that the model retention rate is path-dependent, leading to the development of a routing algorithm that enhances communication efficiency. Through simulations, the proposed framework demonstrates a significant reduction in average transmission latency by 27.8% and an improvement in testing accuracy by approximately 12%, compared to unpruned systems. Additionally, when compared to standard benchmark routing algorithms, the proposed method further enhances accuracy by about 8%. This work is the first to simultaneously consider multi-hop routing and model pruning in D-FL, providing a comprehensive approach to improve convergence performance in resource-limited environments.
Methodology
The authors formulated an optimization problem to maximize model retention rates while minimizing communication latency. They analyzed the dependency of model retention on routing paths and developed a routing algorithm that incorporates node priorities and client-aware link weights to enhance transmission efficiency.
Results
The proposed framework reduced average transmission latency by 27.8% and improved testing accuracy by about 12% compared to unpruned systems. Additionally, it achieved an 8% accuracy improvement over standard benchmark routing algorithms.
Implications
The findings suggest that optimizing routing and pruning strategies can significantly enhance the performance of decentralized federated learning systems, particularly in environments with limited communication resources. This approach can be applied to various applications requiring efficient model training and aggregation in heterogeneous networks.
Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention
Generative Models
- Introduces stochastic attention (SA) for training-free protein sequence generation.
- SA effectively samples from a Boltzmann distribution derived from stored protein sequences.
- Generates sequences with low KL divergence and high structural plausibility.
- Outperforms traditional generative models in maintaining sequence identity to canonical family folds.
Read more
Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention
Summary
This paper addresses the challenge of generating novel protein sequences from small family alignments, a task that typically requires extensive training on large datasets. The authors introduce a novel method called stochastic attention (SA), which operates without the need for training or external data. SA leverages the modern Hopfield energy model to treat a set of stored protein sequences as a Boltzmann distribution, allowing for the generation of protein sequences through Langevin dynamics. The method is computationally efficient, requiring only O(dK) operations per iteration, where d is the sequence representation dimension and K is the number of stored sequences. The authors validate SA across eight Pfam families, demonstrating that it produces sequences with low amino acid Kullback–Leibler divergence, substantial novelty, and structural plausibility as confirmed by ESMFold and AlphaFold2. The generated sequences maintain a higher identity to canonical family folds compared to those produced by traditional models, achieving significant improvements in template modeling scores. Additionally, SA operates quickly on standard hardware, making it accessible for small protein families that are often overlooked by conventional deep learning approaches. The method's ability to capture covariance structures and generalize well across families is further supported by cross-validation and independent validation against existing datasets.
Methodology
The methodology involves using the modern Hopfield energy model to define an energy landscape for a set of stored protein sequences. The score function is derived from a single softmax attention operation, allowing for sampling via Langevin dynamics without the need for a trained neural network. This approach enables the generation of protein sequences directly from small family alignments.
Results
The results indicate that SA generates protein sequences with low amino acid KL divergence (KL ≤ 0.06) and substantial novelty (0.40–0.65). The structural plausibility of the generated sequences is confirmed by ESMFold and AlphaFold2, with mean predicted local distance difference test (pLDDT) scores reaching up to 95. In six out of eight tested families, generated sequences achieved significantly higher template modeling scores compared to natural family members (p < 0.05).
Implications
The implications of this work are significant for the field of protein engineering and computational biology. By enabling the generation of protein sequences from small datasets, this method can facilitate the exploration of orphan protein families and enhance the design of proteins with desired properties, particularly in therapeutic and industrial applications.
A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)
Computer Vision
Theory
Interpretability
- Introduces SAFE-PIT-CM, an autoencoder that integrates a frozen PDE operator for stability-aware latent-space transitions.
- The SAFE operator enables accurate parameter recovery by addressing numerical stability issues in temporal data.
- Supports zero-shot inference, allowing learning from a single simulation without ground-truth labels.
- Demonstrates effectiveness on the heat equation and reverse heat equation, recovering transport coefficients accurately.
Read more
A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)
Summary
The paper introduces SAFE-PIT-CM, a novel autoencoder architecture designed for recovering material parameters and temporal field evolution from videos of physical processes in continuum mechanics. Unlike traditional autoencoders, SAFE-PIT-CM employs a frozen PDE operator, termed the SAFE operator, to govern the latent-space transitions, ensuring numerical stability and accurate parameter recovery. The architecture consists of a convolutional encoder that maps video frames to a latent field, followed by the SAFE operator that propagates this latent field forward in time using sub-stepped finite differences. The decoder then reconstructs the video from the evolved latent field. A key innovation is the ability to perform zero-shot inference, allowing the model to learn transport coefficients from a single simulation without requiring ground-truth labels, leveraging only the physics residual as supervision. The paper demonstrates the effectiveness of SAFE-PIT-CM on both the heat equation and the reverse heat equation, showcasing its capability to recover transport coefficients across different regimes. The method is inherently explainable, as the learned parameters are directly linked to physical quantities, making it suitable for inverse problems in continuum mechanics.
Methodology
The methodology involves a convolutional encoder that maps video frames to a latent field, which is then evolved over time using the SAFE operator that implements sub-stepped finite differences to ensure numerical stability. The model is trained using the physics residual as supervision, allowing for parameter recovery without the need for labeled data.
Results
SAFE-PIT-CM successfully recovers transport coefficients in both diffusion and mobility regimes, achieving accuracy comparable to pre-trained models even in zero-shot scenarios. The architecture outperforms standard black-box video segmentation approaches, providing precise temporal and spatial segmentation of evolving fields.
Implications
The implications of this work extend to various applications in scientific machine learning, particularly in inverse problems within continuum mechanics. The explainable nature of the model enhances its utility in fields requiring interpretability and physical insight, such as materials science and engineering.
Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints
Efficient ML
Theory
Computer Vision
- Introduces a variational Bayesian method with Tucker decomposition for high-dimensional inverse problems.
- Employs adaptive regularization through per-mode precision parameters for anisotropic structures.
- Estimates noise levels from data, eliminating reliance on prior noise knowledge.
- Demonstrates consistent performance improvements over traditional methods in various experimental settings.
Read more
Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints
Summary
This paper presents a novel variational Bayesian method that employs Tucker decomposition to address high-dimensional inverse problems efficiently. The proposed method reduces computational complexity by transforming variational inference from a high-dimensional space to a lower-dimensional core tensor space, leveraging the low-rank structure of the data. A significant innovation is the introduction of per-mode precision parameters, which facilitate adaptive regularization tailored to anisotropic structures. For example, in directional image deblurring, the learned parameters align with physical anisotropy, applying stronger regularization in critical directions. The method also estimates noise levels directly from the data, removing the dependency on prior knowledge of noise parameters, which is a limitation of traditional methods like the discrepancy principle. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across various tasks, including 2D deblurring, 3D heat conduction, and Fredholm integral equations, showing improvements in quantitative metrics such as PSNR and SSIM. The method scales effectively to problems with up to 110,000 variables, achieving notable performance gains in deblurring and heat conduction tasks. However, the approach is sensitive to rank selection in Tucker decomposition and requires further theoretical analysis. Future work will focus on automated rank selection and establishing theoretical guarantees, bridging Bayesian theory with scalable computation for practical applications in imaging, remote sensing, and scientific computing.
Methodology
The methodology involves variational Bayesian inference combined with Tucker decomposition to reduce the dimensionality of the problem. The approach incorporates per-mode precision parameters for adaptive regularization, allowing for tailored solutions based on the anisotropic nature of the data. Noise levels are estimated directly from the data, enhancing the robustness of the method.
Results
The proposed method shows significant improvements in quantitative metrics such as PSNR and SSIM across multiple experimental setups, outperforming traditional methods by up to 2.09 dB in deblurring tasks and 6.75 dB in 3D heat conduction. The approach effectively handles problems with up to 110,000 variables.
Implications
This work has potential implications for various fields, including medical imaging, remote sensing, and scientific computing, where high-dimensional inverse problems are prevalent. The integration of Bayesian methods with low-rank constraints offers a practical solution for large-scale data analysis and enhances the reliability of inverse problem solutions.
Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning
Federated Learning
- FedAOT introduces a meta-learning-based aggregation strategy that dynamically assigns client weights to enhance robustness against Byzantine attacks.
- The adaptive optimization mechanism reduces the impact of unreliable updates without relying on prior attack assumptions.
- FedAOT provides a unified defense against untargeted poisoning and label-flipping attacks in non-IID and heterogeneous data distributions.
- Empirical evaluations show that FedAOT outperforms existing Byzantine-robust approaches in terms of resilience and convergence.
Read more
Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning
Summary
This paper addresses the vulnerabilities of Federated Learning (FL) systems to Byzantine adversaries, which can compromise model performance through malicious updates. The authors propose a novel defense mechanism called FedAOT (Federated Adaptive Optimal Tuning), which utilizes a meta-learning-inspired adaptive aggregation framework. Unlike existing methods that rely on static thresholds or specific attack assumptions, FedAOT dynamically weights client updates based on their reliability, effectively countering multi-label flipping and untargeted poisoning attacks. The framework is designed to generalize across diverse datasets and various attack types, maintaining robust performance even in previously unseen scenarios. Experimental results demonstrate that FedAOT significantly enhances model accuracy and resilience while ensuring computational efficiency, making it a scalable solution for secure federated learning.
Methodology
The authors developed FedAOT, which employs a meta-learning framework to adaptively aggregate client updates based on their reliability. This method involves dynamically adjusting weights for each client's contribution to the global model, allowing for effective suppression of adversarial influences without predefined thresholds or assumptions about attack types.
Results
Experimental evaluations indicate that FedAOT significantly improves model accuracy and resilience against Byzantine attacks compared to traditional aggregation methods. The framework maintains computational efficiency and demonstrates robust performance across various datasets and attack scenarios.
Implications
The findings suggest that FedAOT can be effectively applied in privacy-sensitive domains such as healthcare and finance, where secure federated learning is crucial. The adaptive nature of the aggregation mechanism may also inspire further research into intelligent defenses against evolving adversarial strategies in decentralized learning environments.
Generalization and Memorization in Rectified Flow
Generative Models
Theory
Efficient ML
- Development of three test statistics for Membership Inference Attacks tailored for Rectified Flow models.
- Significant performance improvements in MIA metrics, indicating enhanced understanding of memorization dynamics.
- Identification of a peak susceptibility to MIA at the midpoint of integration during training.
- Proposed substitution of uniform timestep sampling with a Symmetric Exponential distribution to mitigate memorization risks.
Read more
Generalization and Memorization in Rectified Flow
Summary
This paper investigates the memorization behaviors of Rectified Flow (RF) models, a prominent generative model for image synthesis. While RF has been optimized for generation quality, its dynamics of memorization remain underexplored. The authors develop three test statistics for Membership Inference Attacks (MIA), culminating in a complexity-calibrated metric (Tmc_cal) that effectively separates intrinsic image complexity from genuine memorization signals. This calibration leads to significant performance improvements in MIA, with attack AUC increasing by up to 15% and the privacy-critical TPR@1%FPR metric improving by up to 45%. The study reveals a distinct temporal pattern in memorization dynamics, showing that susceptibility to MIA peaks at the midpoint of integration during standard uniform temporal training. The authors mathematically justify this phenomenon and propose a method to reduce memorization risk by substituting uniform timestep sampling with a Symmetric Exponential distribution. Extensive evaluations on CIFAR-10, SVHN, and TinyImageNet datasets confirm that this approach effectively minimizes memorization while maintaining generative fidelity.
Methodology
The authors constructed three test statistics (Tnaive, Tmc, Tmc_cal) to evaluate the memorization risk of RF models. They employed theoretical derivations and empirical evaluations to justify the effectiveness of these metrics. Additionally, they analyzed the temporal dynamics of memorization and proposed a new sampling strategy to reduce memorization exposure during training.
Results
The study demonstrated that the complexity-calibrated metric Tmc_cal significantly improved MIA performance, with AUC increasing by up to 15% and TPR@1%FPR by up to 45%. The analysis revealed that the model's susceptibility to MIA peaked at the midpoint of the integration process, and the new sampling strategy effectively reduced memorization while preserving generative quality across three datasets.
Implications
The findings have important implications for the design of generative models, particularly in addressing privacy concerns related to data memorization. The proposed metrics and sampling strategies can be utilized to enhance the privacy of generative models while maintaining their performance, which is crucial for applications in sensitive domains.
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
Theory
Optimization
Large Language Models
- HorizonMath provides a benchmark of over 100 unsolved mathematical problems, immune to data contamination.
- The framework automates verification of solutions using high-precision numerical comparisons.
- The benchmark aims to measure AI's capability for novel mathematical discovery rather than just problem-solving.
- Initial results show that GPT 5.4 Pro proposed solutions that may improve upon existing mathematical results.
Read more
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
Summary
The paper introduces HorizonMath, a benchmark designed to evaluate AI's ability to make autonomous mathematical discoveries by focusing on over 100 unsolved problems across eight domains in computational and applied mathematics. The authors argue that existing benchmarks primarily assess problem-solving capabilities rather than the ability to generate novel mathematical insights. HorizonMath addresses this gap by providing a contamination-proof benchmark where solutions are unknown, thus ensuring that any correct solution generated by AI indicates genuine reasoning ability. The framework includes automated verification methods that leverage high-precision numerical comparisons and deterministic constraint-checkers, enabling efficient evaluation of proposed solutions. The authors report that using the GPT 5.4 Pro model, two problems were identified where the model proposed solutions that improve upon the best-known results, suggesting potential novel contributions pending expert review. HorizonMath is presented as an open-source resource, inviting community contributions and aiming to facilitate the measurement of AI progress in mathematical discovery.
Methodology
The authors developed HorizonMath as a benchmark consisting of unsolved mathematical problems, utilizing automated verification techniques that include high-precision numerical comparisons and deterministic constraint-checkers to assess the correctness of proposed solutions efficiently.
Results
The study found that the GPT 5.4 Pro model proposed solutions for two optimization problems that potentially improve upon the best-known published results, indicating the model's capability for novel contributions to mathematical literature.
Implications
HorizonMath could significantly advance the field of AI-driven mathematical discovery by providing a standardized method to evaluate AI systems' capabilities in generating novel insights, potentially leading to breakthroughs in various mathematical domains.
FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data
Efficient ML
Theory
Optimization
- FEAT introduces a linear-complexity model for structured data, overcoming the limitations of quadratic self-attention.
- The architecture combines local and global modeling strategies to maintain expressive representations.
- Robustness is enhanced through a hybrid structural causal model and stable reconstruction objectives.
- FEAT shows significant improvements in zero-shot performance and inference speed on real-world datasets.
Read more
FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data
Summary
The paper introduces FEAT, a novel foundation model designed to handle extremely large structured datasets with linear complexity. Traditional large structured-data models (LDMs) struggle with the quadratic complexity of self-attention mechanisms, limiting their scalability and performance on real-world datasets. FEAT addresses this by employing a multi-layer dual-axis encoding architecture that integrates two linear-complexity encoding layers: adaptive-fusion bi-Mamba-2 (AFBM) for local dependencies and convolutional gated linear attention (Conv-GLA) for global memory. This innovative design allows for efficient cross-sample modeling while maintaining expressive representations. Additionally, FEAT incorporates a hybrid structural causal model for robust pre-training, accommodating the heavy-tailed distributions typical in real-world data. The authors demonstrate that FEAT significantly outperforms existing models across 11 real-world structured datasets, achieving up to 40× faster inference and improved zero-shot performance, thus showcasing its potential for diverse applications in fields reliant on large structured data.
Methodology
FEAT employs a multi-layer dual-axis encoding architecture, utilizing adaptive-fusion bi-Mamba-2 (AFBM) for local dependencies and convolutional gated linear attention (Conv-GLA) for global memory. This approach allows for linear-complexity modeling while preserving the expressive capacity of representations. The model is pre-trained using a hybrid structural causal model pipeline tailored for heavy-tailed structured distributions.
Results
FEAT consistently outperformed several baseline models across 11 real-world structured datasets, demonstrating superior zero-shot performance and achieving inference speeds up to 40 times faster than traditional models.
Implications
The development of FEAT has significant implications for various domains such as healthcare, finance, and e-commerce, where large structured datasets are prevalent. Its ability to efficiently process and model these datasets can enhance decision-making, predictive analytics, and automated systems.
Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Stochastic resetting accelerates policy convergence in reinforcement learning environments.
- Resetting improves learning efficiency even when it does not reduce search time for a random walker.
- The mechanism of resetting biases learning towards shorter trajectories and efficient reward propagation.
- Stochastic resetting preserves the optimal policy while enhancing convergence speed.
Read more
Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning
Summary
This paper investigates the effects of stochastic resetting on policy convergence in reinforcement learning (RL). Stochastic resetting, a process where a system is intermittently returned to a fixed reference state, has been shown to optimize first-passage properties in static processes. However, its interaction with adaptive learning processes in RL has not been thoroughly explored. The authors demonstrate that stochastic resetting can significantly accelerate policy convergence in various RL environments, including tabular grid environments and continuous control tasks using deep reinforcement learning. They find that resetting enhances learning efficiency by truncating long, uninformative trajectories, thereby improving value propagation without altering the optimal policy. The study establishes stochastic resetting as a tunable mechanism for accelerating learning in adaptive systems, linking concepts from statistical mechanics to reinforcement learning dynamics.
Methodology
The authors employed stochastic resetting as an external intervention in reinforcement learning algorithms, specifically Q-learning in tabular environments and Deep Q-Networks in continuous control tasks. They analyzed the effects of resetting on policy convergence and sample efficiency across different environments, measuring performance based on the cumulative number of training steps and evaluating the learned value functions.
Results
The study found that stochastic resetting robustly accelerates policy convergence in both tabular GridWorld and stochastic cliff environments, as well as in the continuous-state MountainCar benchmark. In larger environments, resetting improved both search efficiency and learning speed. In smaller systems, faster convergence was observed even when resetting did not reduce the mean first-passage time for a random walker. The results indicate that resetting enhances learning dynamics by modifying the distribution of training trajectories without affecting the learned value function.
Implications
The findings suggest that stochastic resetting can be a valuable tool for improving the efficiency of reinforcement learning algorithms, particularly in environments where exploration is challenging and rewards are sparse. This approach could lead to more effective training strategies in various applications, including robotics and complex decision-making tasks.
RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation
Graph Learning
- RaDAR addresses structural semantics degradation and limited relational expressiveness in recommendation systems.
- The framework combines graph generative and relation-aware denoising models for enhanced robustness.
- Innovations include asymmetric contrastive learning and diffusion-guided augmentation.
- Extensive experiments show superior performance over existing methods in sparse and noisy conditions.
Read more
RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation
Summary
The paper presents RaDAR, a novel framework for recommendation systems that addresses two significant challenges in collaborative filtering: structural semantics degradation and limited relational expressiveness. RaDAR integrates Graph Neural Networks (GNNs) with Graph Contrastive Learning (GCL) to enhance recommendation accuracy, particularly in sparse and noisy environments. The framework employs a dual-view generation mechanism that includes a graph generative model for capturing global structures and a relation-aware denoising model for refining noisy edges. Key innovations include asymmetric contrastive learning with global negative sampling to maintain semantic alignment, diffusion-guided augmentation for robustness through progressive noise injection and denoising, and relation-aware edge refinement that dynamically adjusts edge weights based on latent node semantics. Experimental results on multiple public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly in scenarios characterized by high data sparsity and noise, indicating its effectiveness in improving recommendation performance.
Methodology
RaDAR employs a dual-view generation architecture that integrates a graph generative model based on variational autoencoders and a relation-aware denoising model. It utilizes asymmetric contrastive learning with global negative sampling and implements a diffusion-guided augmentation strategy that applies Gaussian noise to node representations, enhancing robustness while preserving semantic integrity.
Results
RaDAR was tested on various public datasets, including Last.FM, Yelp, and Beer-Advocate, as well as multi-behavior datasets like Tmall and RetailRocket. The results indicate that RaDAR consistently outperforms state-of-the-art recommendation methods, particularly under conditions of high sparsity and noise, demonstrating its effectiveness in improving recommendation accuracy.
Implications
The RaDAR framework has significant implications for the development of more robust recommendation systems that can effectively handle sparse and noisy data. Its ability to capture complex relational patterns can enhance user experience in various applications, including e-commerce, content recommendation, and social networking.
Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function
Theory
Optimization
Robotics
- Introduces a tight, distribution-free uncertainty bound for multivariate kernel regression.
- Addresses limitations of existing bounds that are either conservative or difficult to apply in multi-output cases.
- Utilizes a Gaussian process-based dual function framework for deriving the uncertainty bounds.
- Demonstrates the application of the proposed method through a quadrotor dynamics learning example.
Read more
Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function
Summary
This paper addresses the challenge of providing reliable uncertainty bounds for multivariate kernel regression, particularly in the context of learning-based control where accurate predictions from noisy data are crucial. Traditional methods, such as Gaussian process regression, often rely on strong assumptions about noise distributions or yield conservative bounds that do not scale well for multi-output scenarios. The authors propose a novel, tight, distribution-free uncertainty bound derived from an unconstrained, duality-based formulation. This new bound is designed to be easily integrated into optimization pipelines, making it applicable for real-world scenarios. The methodology extends existing results to accommodate multivariate cases and intersections of ellipsoidal noise bounds, thus generalizing previous deterministic bounds. The paper includes a numerical comparison of the proposed bounds against existing methods, demonstrating its effectiveness through an application inspired by quadrotor dynamics learning under wind disturbances. Overall, the work contributes to enhancing the reliability of predictions in uncertain systems, which is vital for safe control in various applications.
Methodology
The authors derive the uncertainty bounds using an optimization problem that incorporates the worst-case realization of a latent multivariate function, subject to bounded noise constraints. They utilize Gaussian process regression techniques and formulate the problem in terms of a dual function, allowing for a straightforward integration into optimization tasks.
Results
The proposed uncertainty bound is shown to be tighter and less conservative compared to existing methods. The numerical experiments illustrate the practical applicability of the bounds in a scenario involving quadrotor dynamics, highlighting the method's robustness in handling direction-dependent disturbances.
Implications
The findings have significant implications for safe learning-based control systems, where reliable uncertainty quantification is essential. The methodology can be applied in various fields, including robotics, autonomous systems, and any domain requiring accurate predictions from noisy data.
Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
Generative Models
Computer Vision
Optimization
- Adaptive moment estimation significantly stabilizes noisy likelihood scores in guided diffusion sampling.
- The proposed method achieves state-of-the-art results in image restoration and class-conditional generation tasks.
- Performance improves consistently across varying task difficulties, suggesting robustness.
- The approach is simple and computationally efficient compared to more complex alternatives.
Read more
Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
Summary
This paper addresses the challenges of guided diffusion sampling, which often suffers from noise due to the approximation of intractable likelihood scores. The authors propose a novel approach that utilizes adaptive moment estimation, a technique from stochastic optimization, to stabilize these noisy likelihood scores during the sampling process. By maintaining exponential moving averages of the first and second moments of the likelihood score estimates, the method dampens noise while preserving the guidance signal. The empirical results demonstrate that this simple modification leads to state-of-the-art performance in image restoration and class-conditional generation tasks, outperforming more complex and computationally expensive methods. The authors also analyze how task difficulty affects performance, revealing that their method consistently improves results across varying levels of degradation, highlighting the need for comprehensive evaluation standards in plug-and-play methods.
Methodology
The authors employ adaptive moment estimation to stabilize the guidance gradients in plug-and-play diffusion sampling. This involves maintaining exponential moving averages of the first and second moments of the likelihood score estimates across sampling steps, which helps to mitigate gradient noise and improve the alignment of the sampling process.
Results
The proposed method outperforms existing plug-and-play guidance techniques, achieving state-of-the-art results in various tasks, including image restoration and class-conditional generation. The performance remains robust even as task difficulty increases, with adaptive moment estimation consistently enhancing the baseline performance of existing methods.
Implications
The findings suggest that adaptive moment estimation can be a valuable tool for improving the stability and performance of generative models, particularly in applications requiring conditional generation. This could lead to advancements in fields such as image processing, audio generation, and other areas reliant on diffusion models.
Manifold-Matching Autoencoders
Theory
Efficient ML
- Introduction of Manifold-Matching Autoencoder (MMAE) for unsupervised dimensionality reduction.
- MMAE aligns pairwise distances in latent space with input data distances using mean squared error.
- Demonstrated superior performance in preserving topological features compared to existing methods.
- MMAE provides a scalable alternative to Multi-Dimensional Scaling (MDS).
Read more
Manifold-Matching Autoencoders
Summary
This paper introduces the Manifold-Matching Autoencoder (MMAE), an unsupervised regularization technique aimed at enhancing the performance of autoencoders in preserving the geometric and topological structures of high-dimensional data. The core idea of MMAE is to align the pairwise distances in the latent space with those in the input data space by minimizing the mean squared error (MSE) between the two distance matrices. This approach allows for flexibility in dimensionality reduction, as it decouples the dimensionality of the reference space from the latent space. The authors demonstrate that MMAE outperforms existing methods in preserving closest neighbor distances and topological features, as assessed through persistence homology measures. Additionally, MMAE serves as a scalable approximation of Multi-Dimensional Scaling (MDS), effectively recovering complex structures in datasets such as the nested spheres. The paper includes experiments on both synthetic and real-world datasets, showcasing the method's competitive performance against other topological and geometric autoencoder variants.
Methodology
The methodology involves adding a regularization term (MM-reg) to the standard autoencoder objective, which minimizes the MSE between the pairwise distance matrix of the latent space and a reference distance matrix derived from the input data. This allows for the preservation of global geometry and topology in the latent representation.
Results
The results indicate that MMAE significantly improves the preservation of topological and geometric structures in the latent space, outperforming traditional autoencoders and other topological variants in various metrics. The method effectively recovers complex data structures, such as the nested spheres, and shows competitive performance on benchmark datasets.
Implications
The findings suggest that MMAE can be a valuable tool for tasks requiring dimensionality reduction while maintaining the integrity of data structures, such as anomaly detection, data visualization, and generative modeling. Its scalability makes it suitable for large datasets where traditional methods may struggle.
Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy
Optimization
Robotics
Theory
- Introduction of the BEACON framework for novelty-driven exploration in microscopy.
- Benchmarking against classical acquisition strategies to evaluate exploration quality.
- Successful transition from offline validation to real experimental implementation.
- Provision of reproducible notebooks for community use and adaptation.
Read more
Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy
Summary
This paper addresses the challenge of discovering new scientific information in automated microscopy, where critical insights often lie beyond immediately visible features. The authors propose a novel deep-kernel-learning framework called BEACON, designed to actively explore the target space by learning structure-property relationships during experiments. The methodology was validated using pre-acquired datasets, allowing for benchmarking against traditional acquisition strategies. The authors established a set of monitoring functions to evaluate exploration quality and target-space coverage. The framework was successfully transitioned from offline validation to real experimental implementation in Scanning Transmission Electron Microscopy (STEM). The authors provide accessible notebooks for the broader community to reproduce and adapt the workflows, promoting further research and application in automated microscopy.
Methodology
The BEACON framework employs deep-kernel learning to model structure-property relationships dynamically during experiments. It utilizes pre-acquired datasets for benchmarking and establishes monitoring functions to assess exploration quality and target-space coverage. The methodology integrates machine learning with automated microscopy to prioritize measurement locations based on evolving spectral data.
Results
The implementation of the BEACON framework demonstrated improved exploration of target spaces in STEM, effectively identifying diverse response regimes and enhancing the discovery of relevant features in complex materials. The benchmarking against classical methods showed superior performance in terms of exploration quality and target-space coverage.
Implications
The proposed framework has significant implications for materials science, enabling researchers to discover new materials and properties more efficiently. The approach can be adapted to various microscopy techniques, fostering broader applications in automated scientific discovery and enhancing the capabilities of existing microscopy tools.
Unlearning-based sliding window for continual learning under concept drift
Computer Vision
Theory
Efficient ML
- Introduces UIL, a framework that combines machine unlearning with continual learning to address concept drift.
- Demonstrates that unlearning outdated data followed by incremental adaptation can be computationally efficient.
- Empirical results show UIL's effectiveness in image classification tasks with concept drift.
- Establishes a theoretical foundation connecting machine unlearning and concept drift mitigation.
Read more
Unlearning-based sliding window for continual learning under concept drift
Summary
This paper addresses the challenge of continual learning in nonstationary environments where data distributions evolve over time, known as concept drift. Traditional methods often rely on sliding window techniques, which require retraining models from scratch on the most recent data, leading to high computational costs. The authors propose a novel framework called UIL (Unlearned and Iteratively trained cLassifier) that leverages machine unlearning to efficiently manage the influence of outdated samples. By removing the impact of obsolete data and incrementally adapting to new data, UIL provides a targeted forgetting mechanism that preserves model performance while reducing computational demands. The paper presents a theoretical analysis demonstrating that this approach can approximate the predictive performance of full retraining at a lower cost. Empirical evaluations on image classification tasks across various drift scenarios show that UIL outperforms traditional sliding-window methods, offering a competitive and efficient alternative for continual learning under concept drift.
Methodology
The authors propose the UIL framework, which utilizes machine unlearning to remove the influence of outdated samples from a trained model. This is followed by incremental updates with new data, allowing for efficient adaptation to evolving distributions without the need for full retraining. The methodology includes theoretical analysis and empirical testing on image classification benchmarks under various drift scenarios.
Results
The proposed UIL framework demonstrated superior performance compared to traditional sliding-window retraining methods in terms of both predictive accuracy and computational efficiency. Experiments showed that UIL effectively managed concept drift while maintaining lower resource consumption.
Implications
The findings suggest that integrating machine unlearning into continual learning frameworks can significantly enhance the adaptability of models in dynamic environments. This approach could be applied in various real-world applications where data streams are nonstationary, such as finance, healthcare, and autonomous systems.
SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds
NLP
Large Language Models
Optimization
- Introduction of accelerated attention blocks based on inertial dynamics.
- Tokens are represented with both spatial features and velocity variables.
- Demonstrated faster convergence rates than classical attention mechanisms.
- Preservation of elliptically contoured probability distributions.
Read more
SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds
Summary
The paper introduces SympFormer, a novel architecture for transformers that enhances the efficiency of attention blocks by leveraging inertial dynamics on density manifolds. The authors interpret attention mechanisms as interacting particle systems and propose a framework where tokens are represented with both spatial features and velocity variables. By applying Nesterov-type acceleration dynamics, the authors derive Hamiltonian momentum attention blocks that approximate a Stein variational gradient flow, preserving elliptically contoured probability distributions. The proposed architecture demonstrates faster convergence rates compared to classical attention blocks while maintaining the same number of oracle calls. This work not only advances the theoretical understanding of transformers but also provides practical algorithms that can be implemented in existing transformer models.
Methodology
The authors extend the classical transformer architecture by incorporating Nesterov's acceleration method into the attention blocks. They model the attention mechanism as a gradient flow in a probability density space and derive a second-order dynamical system that includes momentum. This results in Hamiltonian dynamics for the tokens, allowing for accelerated convergence in training.
Results
The experimental results indicate that the SympFormer architecture converges faster than traditional attention blocks, achieving improved performance in terms of cross-entropy loss while preserving the number of oracle calls. The theoretical framework supports the preservation of certain probability distributions, enhancing the robustness of the model.
Implications
The findings suggest that incorporating inertial dynamics into transformer architectures can significantly improve training efficiency and model performance. This approach could be applied to various tasks in natural language processing and beyond, potentially leading to more efficient large-scale models.
More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
NLP
Large Language Models
Theory
- Wider beam search can introduce systematic overestimation bias that degrades output quality.
- The maximum useful beam width (k̂) is determined by the signal-to-noise ratio of the scoring mechanism.
- Perplexity scoring shows no benefit at any beam width, while PRM scoring can yield significant performance gains.
- A principled approach for beam width selection is proposed, focusing on output quality rather than inference efficiency.
Read more
More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
Summary
This paper investigates the effects of beam search width on the performance of large language models (LLMs) during reasoning tasks. While wider beam search is generally believed to enhance reasoning quality, the authors reveal that it can lead to systematic overestimation bias, particularly when the scoring mechanism is noisy. Grounded in Extreme Value Theory, the study derives a maximum useful beam width (k̂) that depends on the signal-to-noise ratio of the scorer. The analysis shows that as the candidate pool size increases, the overestimation bias also grows, potentially degrading output quality. The authors validate their theoretical findings through experiments comparing perplexity-guided and Process Reward Model (PRM)-guided beam search across multiple models and domains. Results indicate that perplexity scoring, characterized by high noise, yields no benefits at any beam width, while PRM scoring demonstrates significant improvements at wider beam widths. The paper concludes with recommendations for selecting beam width based on diagnostic indicators, emphasizing the importance of scorer quality over merely increasing beam width.
Methodology
The authors employ Extreme Value Theory to analyze the relationship between beam width and output quality in LLM beam search. They derive a theoretical framework to determine the maximum useful beam width based on the signal-to-noise ratio of the scoring mechanism. Empirical validation is conducted by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten reasoning domains using a dataset of 5,975 questions.
Results
The study finds that perplexity scoring results in a maximum useful beam width of k̂ = 1, indicating no performance benefit from widening the beam. In contrast, PRM scoring allows for a maximum useful beam width of k̂ ≥ 4, with performance improvements of up to 8.9 percentage points. The results underscore the critical role of scorer quality in determining the effectiveness of beam search.
Implications
The findings suggest that practitioners should carefully consider the scoring mechanism used in beam search for LLMs, as it significantly impacts the effectiveness of wider search strategies. The proposed diagnostic indicators for beam width selection can help optimize output quality in practical applications.
FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios
Federated Learning
Generative Models
Computer Vision
- FederatedFactory recovers centralized performance under extreme single-class silo conditions.
- The framework operates without dependencies on external pretrained models, relying solely on localized generative priors.
- It achieves one-shot communication efficiency, reducing the need for iterative updates.
- The framework allows for exact modular unlearning, enhancing privacy and data management.
Read more
FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios
Summary
The paper introduces FederatedFactory, a novel framework for Federated Learning (FL) that addresses the challenges posed by extremely non-IID (Independent and Identically Distributed) data distributions, particularly in scenarios where clients possess mutually exclusive label sets. Traditional FL methods struggle under these conditions due to conflicting optimization trajectories and reliance on pretrained foundation models, which can introduce biases. FederatedFactory inverts the federation unit from discriminative parameters to generative priors, allowing clients to exchange generative modules in a single communication round. This approach enables the synthesis of universally class-balanced datasets from localized data distributions, effectively eliminating gradient conflicts and external biases. The framework demonstrates significant improvements in model performance across various medical imaging benchmarks, achieving centralized performance levels even under extreme label skew. Additionally, it supports exact modular unlearning, allowing for the deterministic removal of specific generative modules without compromising the overall system integrity.
Methodology
FederatedFactory employs a zero-dependency architecture that utilizes localized generative priors instead of traditional discriminative parameters. Each client trains a generative model and transmits it in a single communication round, allowing the server to synthesize class-balanced datasets from a common latent space. This method avoids the projection errors associated with external foundation models and directly leverages true local data distributions.
Results
The evaluations on medical imaging datasets, including CIFAR-10 and ISIC2019, show a dramatic increase in accuracy from 11.36% to 90.57% and AUROC from 47.31% to 90.57%, respectively, under pathological heterogeneity. These results indicate that FederatedFactory can achieve performance comparable to centralized models even in challenging non-IID scenarios.
Implications
FederatedFactory has significant implications for applications in medical imaging and other fields where data privacy is paramount and label distributions are highly skewed. Its ability to synthesize balanced datasets without raw data exchange can facilitate collaborative learning across institutions while maintaining data sovereignty.
GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators
Graph Learning
Theory
Efficient ML
- GIST achieves end-to-end O(N) complexity using random projections.
- The architecture preserves gauge invariance through inner-product-based attention.
- GIST enables discretization-invariant learning, facilitating parameter transfer across different mesh resolutions.
- Empirical results demonstrate state-of-the-art performance on both graph and mesh-based benchmarks.
Read more
GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators
Summary
The paper introduces GIST (Gauge-Invariant Spectral Transformers), a novel graph transformer architecture designed to address the computational and gauge invariance challenges associated with adapting transformers to graph-structured data. Traditional spectral methods for graph embeddings require expensive eigendecomposition, which scales poorly with the number of nodes, while approximate methods often sacrifice gauge invariance, leading to poor generalization in inductive learning tasks. GIST resolves these issues by employing random projections to achieve linear complexity (O(N)) while preserving gauge invariance through inner-product-based attention on projected embeddings. The authors demonstrate that GIST enables discretization-invariant learning, allowing for parameter transfer across various mesh resolutions, which is crucial for neural operator applications. Empirical results show that GIST matches state-of-the-art performance on standard graph benchmarks and excels in mesh-based neural operator tasks, achieving significant results on the DrivAerNet datasets.
Methodology
GIST utilizes random projections to compute spectral embeddings efficiently while maintaining gauge invariance. The architecture restricts attention operations to inner products of projected embeddings, which remain approximately invariant under gauge transformations. This approach allows GIST to achieve linear complexity and ensures that learned features are robust across different graph structures and discretizations.
Results
GIST achieved a micro-F1 score of 99.50% on the PPI benchmark and demonstrated state-of-the-art performance in aerodynamic predictions on the DrivAerNet and DrivAerNet++ datasets, scaling effectively to graphs with up to 750K nodes.
Implications
The development of GIST has significant implications for the field of graph learning and neural operators, particularly in applications requiring robust generalization across varying graph structures and discretizations. This could enhance the performance of models in computational fluid dynamics, structural mechanics, and other domains relying on mesh-based representations.
Multimodal Deep Learning for Early Prediction of Patient Deterioration in the ICU: Integrating Time-Series EHR Data with Clinical Notes
Multimodal
Time Series
NLP
- Introduces a multimodal deep learning model that combines structured EHR data with clinical notes for predicting ICU patient deterioration.
- Achieves a test AUROC of 0.7857, outperforming traditional models that rely solely on structured data.
- Demonstrates that clinical notes significantly enhance predictive performance, improving AUROC by 2.5 percentage points.
- Provides a systematic review of 31 studies, revealing gaps in the integration of clinical text in existing models.
Read more
Multimodal Deep Learning for Early Prediction of Patient Deterioration in the ICU: Integrating Time-Series EHR Data with Clinical Notes
Summary
This paper addresses the critical challenge of early identification of patients at risk for clinical deterioration in the ICU. The authors propose a multimodal deep learning approach that integrates structured time-series data (such as vital signs and laboratory values) with unstructured clinical notes to predict patient deterioration within a 24-hour window. Utilizing the MIMIC-IV database, the study constructs a cohort of 74,822 ICU stays and generates 5.7 million hourly prediction samples. The proposed architecture employs a bidirectional LSTM encoder to capture temporal patterns in physiological data and Clinical-BERT embeddings for clinical notes, fused through a cross-modal attention mechanism. The study also includes a systematic review of 31 existing studies on ICU deterioration prediction, highlighting the limitations of models that rely solely on structured data. The multimodal model achieves a test AUROC of 0.7857 and AUPRC of 0.1908 on held-out samples, demonstrating significant improvements over structured-only baselines. The findings emphasize the importance of integrating clinical notes into predictive models and validate the effectiveness of deep learning approaches in this domain.
Methodology
The study employs a multimodal deep learning architecture that integrates bidirectional LSTM networks for encoding time-series physiological data and Clinical-BERT for processing clinical notes. A cross-modal attention mechanism is used to fuse these data sources. The model is trained and evaluated using a large dataset from the MIMIC-IV database.
Results
The multimodal model achieved a test AUROC of 0.7857 and an AUPRC of 0.1908 on 823,641 held-out samples. Ablation studies indicated that the inclusion of clinical notes improved AUROC by 2.5 percentage points and AUPRC by 39.2% compared to a structured-only baseline. Classical models like XGBoost and logistic regression performed worse than the deep learning approach.
Implications
The findings suggest that integrating unstructured clinical notes into predictive models can significantly enhance the early detection of patient deterioration in the ICU, potentially leading to improved patient outcomes. This work also sets a foundation for future research in multimodal machine learning applications in healthcare.
Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation
Optimization
Theory
Efficient ML
- Introduces a novel formulation for Online Semi-infinite Linear Programming (OSILP) using function approximation.
- Establishes regret bounds that are independent of the number of constraints, enhancing scalability.
- Develops a two-stage algorithm that achieves improved regret bounds under specific assumptions.
- Demonstrates superior performance of the proposed algorithms in experiments compared to existing methods.
Read more
Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation
Summary
This paper addresses the challenge of Online Semi-infinite Linear Programming (OSILP), which arises in dynamic resource allocation problems where the decision space is finite-dimensional but must satisfy a potentially infinite number of constraints revealed through streaming data. The authors propose a novel linear programming formulation that employs function approximation to reduce the number of constraints to a constant, thereby overcoming limitations of traditional online linear programming algorithms that suffer from poor performance due to their dependence on the number of constraints. They develop a dual-based algorithm that is applicable across various scenarios by selecting appropriate potential functions. The authors analyze the algorithm under two classical input models—stochastic input and random permutation—establishing regret bounds that are independent of the number of constraints. They also introduce a two-stage algorithm that improves upon the regret bounds under stricter assumptions. Experimental results demonstrate that their algorithms outperform existing methods when dealing with a large number of constraints, showcasing the practical applicability of their approach in high-dimensional streaming data contexts.
Methodology
The authors employ function approximation to parameterize the dual space, allowing for optimization over non-negative weights. They design a Gradient Descent algorithm for the stochastic input model and extend it to a Mirror Descent framework for both stochastic and random permutation models. The two-stage algorithm further refines the dual variable's convergence to improve regret bounds.
Results
The proposed algorithms achieve regret bounds of O(q√T) for the stochastic input model and O((q + q log T)√T) for the random permutation model. The two-stage algorithm improves the regret to O(q log T + q/ε) under stricter assumptions, demonstrating significant performance enhancements.
Implications
The findings have potential applications in robust optimization, spatio-temporal resource allocation, and real-time control systems, where managing a large number of constraints is crucial. The approach can be applied to various fields that involve high-dimensional streaming data.
Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications
Theory
Interpretability
- Chronological age predictors often struggle with out-of-distribution generalization due to bias from exogenous attributes.
- The paper introduces a framework for learning invariant representations to mitigate bias and enhance fairness.
- An interpretable neural network model based on adversarial representation learning is proposed and evaluated.
- Results show consistency with existing studies, reinforcing the model's predictive capabilities.
Read more
Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications
Summary
This paper addresses the challenges of chronological age prediction in machine learning, particularly focusing on out-of-distribution (OOD) generalization. The authors argue that traditional age predictors often fail due to the influence of exogenous attributes such as race, gender, and tissue type, which can introduce bias and confounding effects. To tackle these issues, the paper proposes a framework for learning invariant representations that mitigate bias and enhance fairness. The authors explore these concepts through theoretical analysis and present an interpretable neural network model based on adversarial representation learning. Using publicly available mouse transcriptomic datasets, they demonstrate the model's performance compared to conventional machine learning approaches. The findings indicate that the proposed model aligns with previous studies on the effects of Elamipretide on muscle tissues, while also highlighting the limitations of deriving causal interpretations from purely predictive models. The paper emphasizes the importance of addressing bias and confounding in predictive modeling to improve generalization and fairness in age prediction tasks.
Methodology
The authors employ adversarial representation learning to create an interpretable neural network model that learns invariant representations. They analyze the model's performance using mouse transcriptomic datasets and compare it with conventional machine learning models. Theoretical rigor is applied to explore concepts of generalization, bias mitigation, and interpretability.
Results
The proposed model demonstrates improved performance in predicting chronological age while addressing bias and confounding effects. The results are consistent with previous findings regarding the impact of Elamipretide on muscle tissues, indicating the model's robustness and predictive validity.
Implications
The findings suggest that improving OOD generalization and fairness in age prediction models can have significant implications for biomedical research and applications, particularly in understanding aging processes and developing interventions. The framework can be adapted for other predictive tasks where bias and confounding are concerns.
Federated Learning for Privacy-Preserving Medical AI
Federated Learning
- Proposes a site-aware data partitioning strategy for realistic federated learning scenarios.
- Introduces an Adaptive Local Differential Privacy mechanism to enhance privacy-utility trade-off.
- Demonstrates that FedProx can match or exceed centralized training performance while ensuring privacy.
- Achieves up to 80.4% accuracy in Alzheimer's classification with improved training stability.
Read more
Federated Learning for Privacy-Preserving Medical AI
Summary
This dissertation explores the application of federated learning (FL) in the context of medical AI, specifically focusing on the classification of Alzheimer's disease using three-dimensional MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The research addresses significant limitations in existing methodologies, such as unrealistic data partitioning, inadequate privacy guarantees, and insufficient benchmarking, which hinder practical deployment in healthcare. A novel site-aware data partitioning strategy is proposed, which preserves institutional boundaries and reflects real-world collaborations. Additionally, an Adaptive Local Differential Privacy (ALDP) mechanism is introduced, which dynamically adjusts privacy parameters based on training progression, significantly enhancing the privacy-utility trade-off compared to traditional fixed-noise approaches. Empirical evaluations across multiple client federations demonstrate that advanced federated optimization algorithms, particularly FedProx, can match or exceed the performance of centralized training while ensuring robust privacy protection. The ALDP mechanism achieved up to 80.4% accuracy in a two-client configuration, outperforming fixed-noise Local Differential Privacy by 5–7 percentage points and exhibiting greater training stability. The dissertation also includes comprehensive ablation studies and benchmarking, establishing quantitative standards for privacy-preserving collaborative medical AI and providing practical guidelines for real-world implementation. This work significantly advances the state-of-the-art in federated learning for medical imaging, laying the methodological and empirical groundwork necessary for the future adoption of privacy-compliant AI in healthcare.
Methodology
The research employs a site-aware data partitioning strategy that reflects real-world institutional collaborations and data heterogeneity. It also introduces an Adaptive Local Differential Privacy (ALDP) mechanism that adjusts privacy parameters dynamically based on training progression. The methodology includes systematic empirical evaluations using advanced federated optimization algorithms, particularly FedProx, across multiple client federations.
Results
The ALDP mechanism achieved an accuracy of up to 80.4% in a two-client configuration, surpassing fixed-noise Local Differential Privacy by 5–7 percentage points. The evaluations showed that advanced federated optimization algorithms could equal or surpass the performance of centralized training while maintaining rigorous privacy protections.
Implications
This research provides a foundation for the practical deployment of federated learning in healthcare, particularly in medical imaging applications. It offers a framework for developing privacy-preserving AI systems that can facilitate collaborative research while ensuring patient data confidentiality.
Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys
NLP
Large Language Models
Efficient ML
- Introduces a unified optimization paradigm for KV cache management that integrates compression and sparsity.
- Develops a sign-based 1-bit vector quantization scheme for efficient token retrieval in compressed domains.
- Eliminates the need for external indices, reducing memory overhead and improving scalability.
- Demonstrates compatibility with existing frameworks, ensuring low latency and high performance.
Read more
Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys
Summary
The paper addresses the inefficiencies in the KV cache of self-attention mechanisms in large language models (LLMs), which pose significant challenges during long-context and large-batch inference. Traditional methods often separate sparsity prediction and compression, leading to redundant overhead and limited scalability. The authors propose a novel approach called Self-Indexing KVCache, which treats compressed key representations as self-indexing structures to enable efficient sparse attention. By introducing a sign-based 1-bit vector quantization (VQ) scheme, the method integrates compression and retrieval into a single, hardware-friendly format, eliminating the need for external indices or complex predictors. This design is optimized for hardware efficiency and is compatible with existing frameworks like FlashAttention. Experimental results demonstrate that the proposed method achieves significant improvements in both efficiency and effectiveness, addressing the key challenges of memory usage, latency, and inference accuracy in KV cache management.
Methodology
The authors propose the Self-Indexing KVCache, which combines dynamic sparsity prediction and quantization into a single framework. The method utilizes a 1-bit vector quantization approach for token selection based on cosine similarity, allowing for fast retrieval directly from the compressed key representation. Custom CUDA kernels are implemented to optimize performance and ensure compatibility with attention acceleration frameworks.
Results
The experimental results indicate that the Self-Indexing KVCache outperforms existing KV cache management strategies in terms of memory efficiency, speed, and accuracy. The method effectively balances the trade-offs between memory usage, latency, and inference accuracy, achieving an overall enhancement in performance.
Implications
This research has significant implications for the deployment of large language models in resource-constrained environments, enabling more efficient inference for applications requiring long-context processing. The unified approach may also inspire further innovations in optimizing memory usage and computational efficiency in various machine learning tasks.
Evidential Domain Adaptation for Remaining Useful Life Prediction with Incomplete Degradation
Time Series
- EviAdapt addresses the limitations of existing domain adaptation methods in RUL prediction with incomplete degradation data.
- The method segments data into distinct degradation stages for accurate stage-wise alignment.
- Evidential uncertainty alignment is introduced to handle varying degradation patterns across domains.
- Extensive experiments show EviAdapt significantly outperforms existing methods.
Read more
Evidential Domain Adaptation for Remaining Useful Life Prediction with Incomplete Degradation
Summary
This paper addresses the challenge of accurately predicting the Remaining Useful Life (RUL) of industrial systems when faced with incomplete degradation data in the target domain. Traditional domain adaptation (DA) methods struggle with this scenario, particularly due to misalignment of degradation stages and varying degradation patterns across domains. The authors propose a novel approach called EviAdapt, which utilizes evidential learning to enhance domain adaptation. EviAdapt segments both source and target domain data into distinct degradation stages based on degradation rates, allowing for stage-wise alignment that ensures accurate matching of samples from corresponding stages. Additionally, it introduces an evidential uncertainty alignment technique that estimates and aligns uncertainty across matched stages. The effectiveness of EviAdapt is validated through experiments on several datasets, including C-MAPSS, N-CMAPSS, and PHM2010, demonstrating significant performance improvements over state-of-the-art methods in RUL prediction under incomplete degradation conditions.
Methodology
The proposed EviAdapt method segments source and target domain data into distinct degradation stages based on degradation rates. It then performs stage-wise alignment to ensure accurate matching of samples from corresponding degradation stages. Additionally, it employs an evidential uncertainty alignment technique to estimate and align uncertainty levels across matched stages.
Results
The experiments conducted on C-MAPSS, N-CMAPSS, and PHM2010 datasets indicate that EviAdapt significantly outperforms existing state-of-the-art domain adaptation methods for RUL prediction, particularly in scenarios with incomplete degradation data.
Implications
The findings suggest that EviAdapt can be effectively used in prognostics and health management of industrial systems, enhancing reliability and maintenance decision-making in environments where complete degradation data is not available.
Introducing Feature-Based Trajectory Clustering, a clustering algorithm for longitudinal data
Time Series
- Introduction of Feature-Based Trajectory Clustering (FBTC) for longitudinal data.
- Two-step methodology: feature extraction followed by clustering using Spectral Clustering.
- Utilization of twenty trajectory measures to represent and cluster time-dependent variables.
- Demonstration of FBTC's effectiveness on various datasets, showcasing its clustering capabilities.
Read more
Introducing Feature-Based Trajectory Clustering, a clustering algorithm for longitudinal data
Summary
This paper introduces Feature-Based Trajectory Clustering (FBTC), a novel algorithm designed for clustering longitudinal data, which consists of time-dependent observations for individuals. The authors aim to identify clusters of individuals whose underlying time-dependent variables exhibit common characteristic features. The methodology involves two main steps: first, transforming each trajectory into a point in a 20-dimensional Euclidean space using specific mathematical measures that capture various features of the trajectories; second, applying the Spectral Clustering algorithm to this point cloud to identify clusters. The paper details the computation of twenty trajectory measures, which serve as approximations of functional measures of the underlying time-dependent variables. The authors provide examples demonstrating the effectiveness of FBTC on various datasets, highlighting its ability to detect non-convex clusters, which sets it apart from traditional methods like K-means and latent class models that focus solely on proximity.
Methodology
The methodology consists of transforming trajectories into a 20-dimensional Euclidean space using specific mathematical measures that capture various features of the trajectories. The resulting point cloud is then clustered using the Spectral Clustering algorithm, which is adept at identifying non-convex clusters.
Results
The paper presents examples illustrating the performance of FBTC on different datasets, demonstrating its ability to effectively cluster individuals based on shared characteristics in their time-dependent variables. The results indicate that FBTC outperforms traditional clustering methods in terms of detecting complex cluster shapes.
Implications
The implications of this research extend to various fields where longitudinal data analysis is crucial, such as healthcare, social sciences, and environmental studies. FBTC can facilitate better understanding of patterns and trends in time-dependent variables, leading to improved decision-making and targeted interventions.
Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism
Reinforcement Learning
Large Language Models
NLP
- DCRL mitigates spurious majority bias through a two-stage vote mechanism.
- The method operates entirely without external models or supervision.
- It generates more reliable learning signals by balancing dominant and diverse responses.
- Extensive experiments show consistent performance improvements across multiple benchmarks.
Read more
Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism
Summary
The paper introduces Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method designed to enhance the performance of large language models (LLMs) in complex reasoning tasks without relying on external supervision or human-annotated datasets. Current label-free RLVR approaches often converge on spurious majority answers, limiting their effectiveness. DCRL addresses this by employing a two-stage consensus mechanism: first, the model acts as an anchor to produce dominant responses, and then as an explorer to generate diverse auxiliary signals through a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets, balancing reliability and diversity. The authors validate DCRL through extensive experiments across eight benchmarks, demonstrating consistent improvements in reasoning performance and more stable training dynamics compared to majority vote methods. This approach establishes a scalable path for unsupervised reinforcement learning, enabling LLMs to evolve and improve without labeled data.
Methodology
The methodology involves a two-stage process where the model first generates dominant responses (anchor stage) and then explores diverse responses (explorer stage) through a temporary unlearning process. The final reward signal is computed using the harmonic mean of the consensus scores from both stages, which helps mitigate the adverse effects of spurious majority votes.
Results
DCRL was tested on the large-scale DAPO-14K-Math dataset and evaluated across eight established benchmarks, showing consistent improvements in Pass@1 metrics over majority vote methods. The results indicate enhanced reasoning capabilities and more stable training dynamics, validating the effectiveness of the proposed approach.
Implications
The findings suggest that DCRL can significantly enhance the reasoning abilities of LLMs in unsupervised settings, making it applicable in domains where labeled data is scarce or unavailable. This could lead to advancements in various AI applications that require robust reasoning capabilities without the need for extensive human annotation.