AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
69
Papers today
8h
Update frequency
7
Days of history
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
Large Language Models
Optimization
Efficient ML
- Identification of the Rank-1 Subspace phenomenon in merged model trajectories.
- Introduction of Extra-Merge, a training-free method for loss minimization.
- Theoretical grounding of the merging process in the context of optimization landscapes.
- Demonstrated effectiveness across various model scales and optimizers.
Read more
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
Summary
This paper investigates the phenomenon of model merging in the context of Large Language Models (LLMs) pre-training, revealing a Rank-1 Subspace characteristic in the optimization trajectories of merged checkpoints. The authors demonstrate that while raw optimization steps exhibit chaotic behavior, merged checkpoints converge onto a stable, approximately one-dimensional linear manifold. This observation is theoretically grounded in a river-valley landscape analysis, suggesting that averaging serves as a geometric low-pass filter that mitigates high-curvature noise, thereby revealing the optimal descent direction. Building on this insight, the authors propose Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss without requiring additional gradient updates. Extensive experiments across various model scales (GPT-2 and LLaMA) show that Extra-Merge consistently outperforms standard merging baselines, achieving significant zero-shot accuracy improvements on downstream tasks such as Pythia-12B. The method also demonstrates robustness across different optimizers, indicating that the Rank-1 Subspace is a universal property of LLM training.
Methodology
The authors conducted empirical analyses using Principal Component Analysis (PCA) to investigate the geometric properties of merged checkpoints. They developed the Extra-Merge algorithm, which leverages the identified Rank-1 Subspace to extrapolate model parameters without additional training. The theoretical framework is based on river-valley loss landscape analysis, providing a justification for the observed phenomena.
Results
The experiments showed that Extra-Merge consistently outperformed standard merging techniques across different model sizes, yielding notable improvements in zero-shot accuracy on Pythia-12B tasks. The method was validated across various learning schedulers and demonstrated generalization capabilities to the Muon optimizer.
Implications
The findings suggest that model merging can be optimized further by understanding the geometric properties of optimization trajectories, potentially leading to more efficient training processes for LLMs. This could reduce computational costs and improve model performance without the need for extensive additional training.
MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction
Efficient ML
- Introduction of MTL-FNO, a lightweight multi-task framework for sparse field reconstruction.
- Utilizes hard parameter sharing to efficiently capture common features across multiple tasks.
- Implements low-rank terms for task-specific parameters to achieve model compression.
- Develops a decoupled optimization scheme for spectral weights to reduce task conflicts.
Read more
MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction
Summary
The paper presents MTL-FNO, a lightweight multi-task Fourier neural operator designed for efficient sparse field reconstruction in aerospace applications. Traditional deep learning models for single-field reconstruction face challenges when scaling to multiple fields, particularly in terms of model size and the inability to leverage cross-field correlations. MTL-FNO addresses these issues through an end-to-end joint training framework that employs hard parameter sharing, allowing the model to capture common features across multiple tasks while maintaining task-specific characteristics. The architecture includes shared and task-specific parameters, with the latter implemented as low-rank terms to enhance model compression. The authors introduce a decoupled optimization scheme for the spectral weights of the Fourier neural operator, utilizing polar decomposition to separate phase and amplitude components, which helps mitigate task conflicts. The Cayley transform is also employed to maintain geometric fidelity during training. The effectiveness of MTL-FNO is validated through experiments on two engineering cases, demonstrating that it achieves reconstruction accuracy comparable to or better than traditional FNO models while significantly reducing model size by 76% and 60% in the respective cases.
Methodology
The methodology involves a multi-task learning framework with hard parameter sharing, where parameters are divided into shared and task-specific components. The spectral weights are optimized using a decoupled approach based on polar decomposition, allowing for separate optimization of phase and amplitude. The Cayley transform is used to ensure geometric fidelity during training.
Results
The experiments show that MTL-FNO achieves reconstruction accuracy that is comparable to or exceeds that of traditional Fourier neural operators, while reducing the total model size by 76% and 60% in two different engineering scenarios.
Implications
The proposed MTL-FNO framework has significant implications for the deployment of deep learning models in resource-constrained environments, such as onboard aerospace vehicles, where efficient multi-field reconstruction is critical for autonomous operations.
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice
Theory
- TFMs often violate economic principles in discrete choice predictions.
- A two-stage adapter is proposed to integrate TFM predictions within a utility-maximization framework.
- The adapter guarantees economic consistency while recovering accuracy gains from TFMs.
- On tested datasets, the adapter outperformed standard multinomial logit models significantly.
Read more
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice
Summary
This paper addresses the issue of economic validity in tabular foundation models (TFMs) used for discrete choice prediction tasks. While TFMs demonstrate high accuracy, they often produce predictions that violate economic principles, such as increasing demand with rising prices and generating implausible willingness-to-pay estimates. The authors propose a two-stage adapter that integrates TFM predictions into a utility-maximization framework. In the first stage, a standard choice model is estimated with parameters constrained to adhere to economic theory. In the second stage, these parameters are fixed, and a correction term is trained using TFM predictions as additional information. This approach ensures that the model maintains economic consistency while leveraging the accuracy of TFMs. The proposed method was tested on two transportation datasets, showing significant improvements in accuracy while adhering to economic constraints.
Methodology
The authors conducted a behavioral audit to assess the economic validity of TFMs by testing for monotonicity, value of time, and availability compliance. They then developed a two-stage behavioral adapter that first estimates a standard choice model with constrained parameters and subsequently incorporates TFM predictions as a correction term, ensuring that the economic structure remains intact during training.
Results
The proposed adapter achieved up to 13 percentage points improvement in accuracy over a standard multinomial logit model on the Swissmetro dataset while maintaining perfect economic validity. On the LPMC dataset, it gained 2 percentage points, demonstrating the adapter's effectiveness in preserving structural guarantees.
Implications
This work has significant implications for policy evaluation and decision-making in various fields, including transportation and economics. By ensuring that predictive models adhere to economic principles, the findings can lead to more reliable forecasts and better-informed policy interventions.
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
Multimodal
Large Language Models
NLP
- PRISM introduces a lightweight plugin design that separates algorithm development from MLLM backbone implementation.
- The framework supports a unified benchmarking suite, facilitating fair comparisons across different methods.
- PRISM enhances scalability and reproducibility in MCIT research by integrating widely used large-scale training pipelines.
- The modular architecture allows for easy integration of new methods and benchmarks as standalone plugins.
Read more
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
Summary
The paper introduces PRISM, a novel plug-in infrastructure designed to facilitate scalable Multimodal Continual Instruction Tuning (MCIT) for Multimodal Large Language Models (MLLMs). Current MCIT research faces significant engineering challenges due to the complexity of MLLM architectures, which often leads to fragmented codebases and difficulties in method comparison. PRISM addresses these issues by providing a lightweight plugin registration mechanism that decouples algorithmic development from the MLLM backbone, allowing for easier integration of new methods without altering the core codebase. This modular design promotes code reuse and accelerates the development of new strategies. Additionally, PRISM supports a unified benchmarking suite and large-scale training pipelines, ensuring reproducibility and scalability in experiments. The framework is built on widely used libraries such as PyTorch and DeepSpeed, making it extensible and compatible with various multimodal backbones. Overall, PRISM aims to streamline the research process in MCIT, enabling researchers to focus on algorithmic innovation while maintaining a robust and reproducible infrastructure.
Methodology
PRISM employs a modular infrastructure that decomposes complex workflows into reusable components, including methods, benchmarks, and evaluation modules. It utilizes a lightweight plugin registration mechanism to integrate new strategies without modifying the underlying MLLM codebase. The framework supports distributed optimization techniques and is built on top of established libraries like PyTorch and DeepSpeed for efficient training.
Results
The paper demonstrates that PRISM effectively addresses the engineering bottlenecks present in existing MCIT frameworks by providing a unified backbone design and supporting large-scale experiments. It showcases improved algorithmic coverage and streamlined workflows, enabling researchers to conduct reproducible and scalable MCIT experiments.
Implications
PRISM has the potential to significantly advance the field of continual learning in multimodal contexts by providing a robust infrastructure that encourages innovation and reproducibility. It can be applied in various domains requiring continuous adaptation to new tasks, such as robotics, interactive AI systems, and adaptive user interfaces.
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Large Language Models
Optimization
- Step-TP provides step-level supervision for tensor program optimization, enhancing LLM reasoning capabilities.
- The dataset is designed around principles that ensure token efficiency and interpretable decision-making.
- Structured chain-of-thought reasoning is integrated to facilitate reliable multi-step optimization.
- The dataset aims to overcome limitations of existing datasets that primarily focus on outcome-only supervision.
Read more
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Summary
The paper introduces Step-TP, a novel dataset designed to enhance the optimization of tensor programs using large language models (LLMs). Traditional approaches to tensor program optimization have struggled due to the lack of step-level supervision and interpretability, which are crucial for making precise transformation decisions. Step-TP addresses these challenges by providing a structured dataset that supports grounded, atomic, step-level supervision with chain-of-thought reasoning. The dataset is constructed around four guiding principles: it offers a token-efficient intermediate representation that deterministically lowers to TVM TIR, employs atomic and composable optimization strategies for interpretable decision-making, incorporates structured CoT supervision with explicit state transitions, and utilizes strategy filtering to maintain coverage while avoiding shortcut exploitation. This design enables LLMs to perform reliable multi-step optimizations rather than merely imitating outcomes. The authors argue that such a dataset is essential for transforming LLMs into effective optimization agents capable of reasoning about complex tensor programs in a systematic manner.
Methodology
The authors developed Step-TP by creating a dataset that includes grounded intermediate program states and atomic optimization actions. They emphasized structured chain-of-thought reasoning and explicit state transitions to support iterative decision-making in tensor optimization. The dataset was designed to be token-efficient and to cover a diverse range of optimization strategies.
Results
The introduction of Step-TP is expected to significantly improve the performance of LLMs in tensor program optimization tasks by providing the necessary step-level supervision and structured reasoning capabilities. The dataset allows for more reliable and interpretable optimization processes compared to existing datasets.
Implications
Step-TP has the potential to advance the field of automated tensor program optimization, enabling more efficient execution of deep learning models on GPUs. It could lead to better resource utilization and reduced latency in deploying large-scale machine learning applications.
Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence
Optimization
Theory
- Introduces a decision support framework for digital therapeutics that models both treatment recommendations and patient adherence.
- Utilizes a linear dynamical system to capture the time-varying nature of patient engagement and its effects on adherence.
- Presents the UCB-BOLD algorithm, which achieves sublinear regret in online treatment selection.
- Demonstrates significant performance improvements over existing benchmarks in managing patient adherence.
Read more
Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence
Summary
This paper addresses the challenge of sustaining long-term patient health in chronic disease management through digital therapeutics (DTs). The authors identify a critical gap in existing decision support frameworks for DTs, which typically model only treatment recommendation effects or treat adherence as an exogenous factor. To fill this gap, they propose a decision support framework that integrates both recommendation and adherence effects using a linear dynamical system (LDS). This model captures the time-varying capacity for patient engagement, linking it endogenously to adherence behavior through a logit function. The authors introduce an optimism-based algorithm, UCB-BOLD, for online treatment selection, demonstrating that it achieves sublinear regret. The framework is evaluated through ablation studies on synthetic patient cohorts generated from micro-randomized trial data, showing that UCB-BOLD significantly outperforms benchmark algorithms. The findings suggest that incorporating dynamical models into DT decision support tools can enhance resource allocation and improve patient health outcomes, particularly for long-term behavioral interventions.
Methodology
The authors develop a decision support framework using a linear dynamical system (LDS) that models the patient's engagement capacity as a function of previous recommendations and adherence. They establish finite-time identification guarantees for this model and propose the UCB-BOLD algorithm for online treatment selection, proving its effectiveness in minimizing regret.
Results
The UCB-BOLD algorithm outperformed benchmark algorithms by achieving 2-3 times lower conditional value-at-risk regret in a synthetic patient cohort study, demonstrating the effectiveness of the proposed framework in optimizing digital therapeutic interventions.
Implications
The findings suggest that digital therapeutic decision support tools can benefit from incorporating dynamical models to improve patient engagement and health outcomes. This approach allows clinicians to make more informed treatment recommendations, ultimately enhancing the management of chronic diseases.
Linear and Neural Dueling Bandits with Delayed Feedback
Reinforcement Learning
Theory
Optimization
- Introduces two novel algorithms for dueling bandits with delayed feedback: LDB-DF and NDB-DF.
- Utilizes an Inverse Probability Weighting mechanism to ensure unbiased estimation despite delayed feedback.
- Establishes theoretical regret bounds for both linear and neural settings.
- Demonstrates superior performance of proposed methods through extensive experiments on simulated and real-world datasets.
Read more
Linear and Neural Dueling Bandits with Delayed Feedback
Summary
This paper addresses the challenge of contextual dueling bandits in scenarios where feedback is delayed, which is common in real-world applications such as recommender systems and large language model alignment. Traditional algorithms assume immediate feedback, leading to biases when this assumption is violated. The authors formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear Dueling Bandits with Delayed Feedback (LDB-DF) and Neural Dueling Bandits with Delayed Feedback (NDB-DF). A key innovation is the integration of an Inverse Probability Weighting (IPW) mechanism into the loss function, which corrects for biases introduced by delayed feedback. Theoretical analysis establishes regret bounds of ΛO(dβT) for the linear case and sub-linear guarantees for the neural case. Extensive experiments demonstrate the effectiveness of the proposed methods over baseline approaches that do not account for feedback delays.
Methodology
The authors formalize the problem of contextual dueling bandits with stochastic delayed feedback and propose LDB-DF and NDB-DF algorithms. They incorporate an Inverse Probability Weighting mechanism into the loss function to correct for biases due to delayed feedback. Theoretical analysis is conducted to derive regret bounds, and experiments are performed on various datasets to validate the effectiveness of the proposed methods.
Results
The proposed algorithms achieve a regret bound of ΛO(dβT) for the linear setting and sub-linear guarantees for the neural setting. Experimental results show that LDB-DF and NDB-DF outperform baseline methods that ignore feedback delays, indicating their effectiveness in real-world applications.
Implications
The findings have significant implications for improving preference-based decision-making systems, particularly in applications where feedback is delayed, such as recommender systems and human-in-the-loop reinforcement learning. The proposed methods can enhance the reliability and accuracy of these systems.
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
Time Series
Graph Learning
Optimization
- Introduction of TA-ANP framework for traffic state inference.
- Effective fusion of multi-source data (FCD and fixed-detector measurements).
- Rapid adaptation to changes in sensing configurations without retraining.
- Joint handling of multiple GTSI sub-tasks with minimized interference.
Read more
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
Summary
This paper addresses the challenge of inferring network-wide traffic states from sparse observations, which is crucial for intelligent transportation systems (ITS). The authors propose a novel framework called the Task-Aware Attentive Neural Process (TA-ANP) that integrates floating car data (FCD) with fixed-detector measurements to enhance global traffic state inference (GTSI). By treating GTSI as a stochastic process, TA-ANP utilizes meta-learning properties to adapt quickly to changes in sensing configurations without requiring retraining. The framework incorporates a multi-query attention module to effectively manage three sub-tasks: real-time estimation at unobserved locations, forecasting at both observed and unobserved locations, while minimizing cross-task interference. Additionally, the authors employ Monte Carlo Dropout for uncertainty quantification, capturing both aleatoric and epistemic uncertainties. To validate their approach, they introduce the Metropolitan Multi-Source Traffic Dataset (MMTD), which combines various data sources over a large urban network. Experimental results demonstrate that TA-ANP outperforms existing methods across all sub-tasks, providing well-calibrated uncertainties that facilitate more efficient sensor placement and showcasing resilience to disturbances in the sensing environment.
Methodology
The authors developed the TA-ANP framework, which employs a task-aware multi-query attention mechanism to address three GTSI sub-tasks. The framework utilizes meta-learning for rapid adaptation and combines neural processes with Monte Carlo Dropout for uncertainty quantification. The Metropolitan Multi-Source Traffic Dataset (MMTD) was created to support large-scale evaluation.
Results
TA-ANP achieved state-of-the-art performance across all GTSI sub-tasks under both deterministic and probabilistic metrics. The framework provided well-calibrated uncertainty estimates, enabling more efficient sensor placement and demonstrating superior resilience in adapting to unseen sensing configurations.
Implications
The findings suggest that TA-ANP can significantly improve traffic state inference in urban environments, leading to better decision-making in transportation management. The framework's resilience to disturbances may enhance the reliability of intelligent transportation systems, potentially influencing future sensor deployment strategies.
Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback
Graph Learning
- Introduces the 3C framework for adaptive learning support focusing on knowledge monitoring.
- Utilizes large language models to extract learners' perceptions from open-ended self-reports.
- Employs a heterogeneous graph neural network for inferring latent perceived states.
- Demonstrates high accuracy in predicting knowledge states and positive feedback reception from users.
Read more
Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback
Summary
The paper presents the Capture-Calibrate-Coach (3C) framework aimed at enhancing adaptive learning support by addressing the metacognitive aspect of knowledge monitoring. The framework consists of three phases: Capture, Calibrate, and Coach. In the Capture phase, learners' perceived knowledge states are extracted from open-ended self-reports to create a heterogeneous graph linking learners and knowledge concepts. The Calibrate phase employs a heterogeneous graph neural network (HGNN) to infer latent perceived states for concepts not explicitly mentioned, thereby enabling a systematic assessment of knowledge monitoring. Finally, the Coach phase classifies learners into five metacognitive patterns and provides personalized feedback that targets knowledge gaps and calibration errors. The evaluation involving 684 students achieved an AUC of 85.21% in predicting latent perceived states, outperforming baseline methods. A user study with 47 participants indicated a positive reception of the feedback quality, particularly valuing concrete guidance on knowledge gaps and actionable study strategies. This work contributes to the development of AI-based learning systems that promote accurate self-awareness and support knowledge growth.
Methodology
The methodology involves three main phases: 1) Capture - extracting learners' perceptions using large language models to create a heterogeneous graph; 2) Calibrate - using a heterogeneous graph neural network to infer latent perceived states as a link prediction problem; and 3) Coach - classifying learners into metacognitive patterns and delivering personalized feedback based on their knowledge monitoring abilities.
Results
The framework achieved an AUC of 85.21% in predicting latent perceived states during evaluation with 684 students, significantly outperforming baseline methods. A user study with 47 participants revealed a favorable perception of the feedback quality, highlighting the value of specific feedback on knowledge gaps and actionable guidance.
Implications
The findings suggest that the 3C framework can enhance self-regulated learning by improving learners' metacognitive awareness and providing tailored feedback. This approach could be integrated into educational technologies to support personalized learning experiences and foster better academic outcomes.
MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding
Multimodal
Time Series
- MULTISEISMO is a large-scale multimodal seismic dataset integrating waveform data, geographical imagery, and metadata.
- The dataset includes over 16,000 seismic events spanning 13 years, formatted in a standardized JSON structure.
- MISCE, a multimodal instruction set, enables effective training and evaluation of GMMs on seismic tasks.
- SeisModal, the first domain-specific multimodal model for seismic analysis, shows superior performance compared to general-purpose models.
Read more
MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding
Summary
The paper introduces MULTISEISMO, a comprehensive multimodal seismic dataset designed to enhance cross-modal understanding in seismology. It addresses the limitations of existing datasets that primarily focus on single modalities, such as time-series data, by integrating waveform recordings, geographical imagery, and contextual metadata into a unified structure. The dataset comprises over 16,000 seismic events collected over 13 years (2010-2023) from various geographical regions, formatted in JSON for ease of use. Additionally, the authors develop a multimodal instruction set, MISCE, which facilitates the training and evaluation of generalist multimodal models (GMMs) on seismic reasoning tasks. They augment an existing model, Unified-IO 2, with a specialized time-series encoder to create SeisModal, the first domain-specific multimodal model tailored for seismic analysis. Evaluation shows that SeisModal significantly outperforms general-purpose models on seismic reasoning tasks, demonstrating the effectiveness of the dataset and the model adaptations. This work not only provides a valuable resource for seismic research but also sets a benchmark for future multimodal studies in specialized scientific domains.
Methodology
The authors developed MULTISEISMO by systematically curating seismic data from the USGS National Earthquake Information Center, integrating multiple modalities into a standardized format. They also created MISCE to guide the training of GMMs on various seismic reasoning tasks. SeisModal was developed by enhancing Unified-IO 2 with a specialized time-series encoder to effectively process seismic waveform data.
Results
Evaluation of SeisModal against other GMMs demonstrated its superior performance in seismic reasoning tasks, particularly in handling time-series data and cross-modal analysis. The results highlight the challenges faced by general-purpose models in processing seismic data and validate the effectiveness of the dataset and the architectural adaptations made.
Implications
The MULTISEISMO dataset and SeisModal model provide essential resources for advancing multimodal AI applications in seismology. The methodologies established can be adapted for other specialized scientific domains, promoting further research and development in multimodal machine learning.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
Reinforcement Learning
Theory
Optimization
- Introduces a quantile Bayesian risk-aware MDP framework to manage the robustness-exploration trade-off in online RL.
- Establishes a theoretical foundation for the impact of quantile levels on decision-making under uncertainty.
- Proposes an adaptive quantile schedule that shifts focus from robustness to exploration as data accumulates.
- Demonstrates strong empirical performance in environments with varying exploration demands.
Read more
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
Summary
This paper addresses the challenge of balancing robustness and exploration in online reinforcement learning (RL) under conditions of epistemic uncertainty due to limited data. The authors introduce a quantile Bayesian risk-aware Markov decision process (BR-MDP) framework, where the quantile level modulates how uncertainty influences the decision-making process. They derive an asymptotic normality result that characterizes the relationship between the quantile BR-MDP value and the true environment value, revealing that upper and lower-tail quantiles induce optimism and pessimism, respectively, towards uncertainty. The proposed online Bayesian risk-aware algorithm features an adaptive quantile schedule that prioritizes robustness in the early stages of learning and gradually shifts focus towards exploration of less-visited state-action pairs as data accumulates. The authors establish sublinear Bayesian regret bounds for both the true optimal value and the optimal BR-MDP robust value, demonstrating the effectiveness of their approach through numerical experiments in various environments that require different levels of exploration.
Methodology
The authors develop a quantile Bayesian risk-aware Markov decision process (BR-MDP) that incorporates a quantile level to control the influence of epistemic uncertainty on the Bellman backup. They analyze the robustness-exploration trade-off through theoretical results and propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule.
Results
The proposed algorithm achieves sublinear Bayesian regret bounds relative to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments show that the algorithm performs well in both exploration-demanding and exploration-costly environments, effectively balancing robustness and exploration.
Implications
This work has significant implications for applications in high-stakes decision-making scenarios, such as public health interventions and inventory management, where balancing exploration and robustness is crucial. The adaptive approach can enhance learning efficiency in environments with limited interaction budgets.
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Time Series
- Falcon-X addresses the limitations of existing TSFMs by enabling effective cross-variate modeling.
- The model utilizes a Unified Prototype Diff-Attention mechanism for improved semantic alignment.
- Latent Entity Attention allows for efficient cross-variate interactions in a unified latent space.
- Falcon-X demonstrates state-of-the-art performance on benchmark datasets for time series forecasting.
Read more
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Summary
Falcon-X introduces a novel approach to time series forecasting by addressing the limitations of existing time series foundation models (TSFMs) that primarily focus on univariate data. The paper highlights the challenges of semantic alignment and relational expressivity in cross-variate modeling, where raw-space group mixing fails to adequately capture the complex interactions between heterogeneous physical quantities. To overcome these issues, Falcon-X decouples variates from the raw space and maps them into a unified latent prototype space. This model employs a Unified Prototype Diff-Attention mechanism that evaluates both positive and negative semantic affinities, allowing for more effective alignment of heterogeneous variates. Cross-variate interactions are facilitated through Latent Entity Attention within this shared space, enabling zero-shot structural transfer. Additionally, a Variate Reassembly Router reconstructs variate-specific trajectories through a request-and-dispatch mechanism. Extensive evaluations on the GIFT-Eval and fev-bench benchmarks demonstrate that Falcon-X achieves state-of-the-art forecasting performance, establishing a scalable and principled framework for complex multivariate environments. The model is publicly released to support future research.
Methodology
Falcon-X employs a novel encoder-only architecture with 591 million parameters, decoupling variates into a fixed semantic space. It utilizes a Unified Prototype Diff-Attention mechanism to capture both synergistic and antagonistic relationships among variates. Cross-variate interactions are performed in a unified latent space using Latent Entity Attention, while a Variate Reassembly Router reconstructs original variate trajectories.
Results
Falcon-X achieved state-of-the-art forecasting performance on the GIFT-Eval and fev-bench benchmarks, outperforming existing models in capturing complex multivariate dynamics and demonstrating effective cross-domain transfer capabilities.
Implications
The advancements presented by Falcon-X could significantly enhance forecasting accuracy in various applications, including finance, healthcare, and environmental monitoring, where understanding complex multivariate relationships is crucial.
RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism
NLP
Large Language Models
Efficient ML
- RotMoLE introduces a rotational gating mechanism to enhance expert selection in MoE architectures.
- The framework allows for complex spatial transformations of expert outputs, improving representation and generalization.
- Empirical results show significant performance improvements in multi-task and multilingual learning scenarios.
- RotMoLE leverages low-rank structures to maintain parameter efficiency while enhancing model capabilities.
Read more
RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism
Summary
The paper introduces RotMoLE, a novel framework that enhances the Mixture of Low-Rank Experts (MoE-LoRA) by incorporating a rotational gating mechanism. Traditional MoE architectures typically utilize scalar gating, which limits their ability to represent complex knowledge effectively. RotMoLE addresses this limitation by implementing a rotation gate for each selected expert, allowing for more sophisticated transformations beyond mere scaling. This approach is particularly beneficial in scenarios with limited expert candidates, such as multi-task and multilingual learning. The authors demonstrate that RotMoLE significantly improves the performance of LLMs in complex tasks by enabling better expert specialization and exploitation. Empirical evaluations confirm the effectiveness of RotMoLE in enhancing the representational capacity of LLMs, particularly in challenging multi-task and multilingual contexts.
Methodology
The authors propose a new MoE framework, RotMoLE, which integrates a rotation gate alongside traditional scaling gates for each expert. This allows for a two-dimensional rotation of expert outputs, enhancing the model's ability to exploit and specialize in diverse data. The framework is built upon the low-rank adaptation principles of MoE-LoRA, ensuring parameter efficiency while expanding the representational capabilities of the model.
Results
The empirical evaluations conducted in the paper demonstrate that RotMoLE outperforms existing MoE architectures in complex multi-task and multilingual training scenarios. The results indicate that the rotational gating mechanism significantly enhances the model's ability to adapt and generalize across diverse tasks, leading to improved performance metrics compared to traditional approaches.
Implications
The introduction of RotMoLE has significant implications for the development of more efficient and capable LLMs, particularly in applications requiring the integration of specialized knowledge across multiple domains. This framework could be applied in various fields such as natural language processing, where complex task adaptation is essential.
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
Reinforcement Learning
Large Language Models
Efficient ML
- Pilot-Commit framework improves rollout allocation efficiency in group-based RL.
- The framework uses a two-stage process to evaluate prompt informativeness and allocate resources accordingly.
- Pilot-Commit achieves baseline accuracy with significantly fewer rollouts compared to existing methods.
- The proposed method adapts to the evolving policy, optimizing the learning signal from prompts.
Read more
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
Summary
This paper addresses the computational inefficiencies in reinforcement learning (RL) for post-training large language models (LLMs), particularly in the context of group-based policy optimization methods. The authors highlight that traditional rollout allocation methods waste computational resources by allocating budgets to prompts with low reward variance, which do not contribute significantly to learning. To mitigate this, they propose a novel framework called Pilot-Commit, which consists of two stages: a pilot stage that estimates the informativeness of prompts using a fraction of the rollout budget, and a commit stage that allocates the remaining budget to high-leverage prompts. This approach allows for more efficient use of rollouts by focusing on prompts that yield the most informative gradients. The paper demonstrates the effectiveness of Pilot-Commit across various math reasoning benchmarks and model sizes, showing that it can achieve baseline accuracy with significantly fewer rollouts compared to existing methods. The results indicate that Pilot-Commit can reach target accuracy up to 1.9 times faster than GRPO and 4.0 times faster than DAPO, highlighting its potential to improve the efficiency of RL post-training.
Methodology
The authors introduce the Pilot-Commit framework, which consists of a pilot stage for estimating prompt informativeness and a commit stage for allocating rollouts to high-leverage prompts. This method decouples evaluation from exploitation, allowing for dynamic adjustment of rollout allocation based on the current policy's performance. The framework is designed to maximize the learning signal by focusing on prompts with high reward variance.
Results
Pilot-Commit consistently matches baseline accuracy while using up to 1.9 times fewer rollouts than GRPO and 4.0 times fewer than DAPO across different model scales (1.5B to 14B parameters). The method demonstrates significant reductions in sampling costs and faster convergence to target accuracy.
Implications
The Pilot-Commit framework has the potential to enhance the efficiency of RL post-training for large language models, making it feasible to adapt these models to specific objectives with reduced computational resources. This could lead to broader applications in areas requiring fine-tuning of LLMs, such as personalized AI systems and complex decision-making tasks.
AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting
Time Series
- AME-TS utilizes a structure-guided approach to improve expert specialization in time series forecasting.
- The model employs a regime predictor to derive interpretable temporal descriptors that inform expert routing.
- AME-TS achieves a strong accuracy-efficiency tradeoff, outperforming existing models at smaller scales and remaining competitive at larger scales.
- The routing mechanism in AME-TS is more interpretable and stable compared to traditional MoE architectures.
Read more
AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting
Summary
The paper introduces AME-TS, a novel time series forecasting model that leverages a structure-guided Mixture-of-Experts (MoE) approach to enhance forecasting accuracy and efficiency. Traditional time series models often utilize a shared dense computation path, which fails to account for the diverse temporal structures inherent in different time series data. AME-TS addresses this limitation by employing a lightweight regime predictor that estimates series-level descriptors such as forecastability, seasonality, trend, and sparsity. These descriptors are then used to create a soft structural prior that guides expert routing during training, promoting specialization aligned with the temporal characteristics of the data. The authors demonstrate that AME-TS significantly outperforms existing models on the GIFT-Eval benchmark, particularly at smaller scales, while remaining competitive at larger scales. Additionally, AME-TS exhibits improved interpretability in routing and more stable expert specialization compared to standard MoE models, suggesting that structure-aware routing can effectively enhance the performance of sparse expert models in time series forecasting.
Methodology
AME-TS employs a lightweight regime predictor to extract interpretable temporal descriptors from time series data. These descriptors are mapped to a soft structural prior over experts, which guides token-level routing during training through a prior-alignment loss. This approach encourages specialization aligned with the temporal structure of the data while maintaining flexibility in expert routing.
Results
On the GIFT-Eval benchmark, AME-TS shows a significant improvement in forecasting accuracy and efficiency, outperforming existing small-scale time series foundation models and remaining competitive with larger models. The model also demonstrates more interpretable routing and greater stability in expert specialization during fine-tuning on the M5 dataset.
Implications
The findings suggest that structure-aware routing can enhance the effectiveness of sparse expert models in time series forecasting, making AME-TS a valuable tool for applications in retail demand planning, cloud operations, healthcare, and other domains reliant on accurate time series predictions.
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
Large Language Models
NLP
- Prompt sensitivity significantly affects the performance of LLMs in vulnerability detection.
- Standard chain-of-thought prompting outperforms other strategies in operational performance.
- Few-shot prompting benefits are model-dependent and most effective for prompt-sensitive models.
- Adaptive chain-of-thought and self-consistency can lead to reduced recall and increased abstention.
Read more
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
Summary
The paper introduces PromptAudit, a framework designed to evaluate the sensitivity of large language models (LLMs) in the context of vulnerability detection. The authors highlight the variability in LLM performance based on different prompt formulations, which has not been adequately characterized in previous studies. By fixing the dataset, decoding parameters, and evaluation pipeline while varying only the prompting strategy, they conduct a systematic analysis using five prompting strategies across five open-weight models on a dataset of 1,000 Common Vulnerabilities and Exposures (CVEs). The study reveals that standard chain-of-thought prompting yields the best overall performance, while few-shot prompting shows model-dependent advantages. However, adaptive chain-of-thought prompting often leads to reduced recall, and self-consistency can cause excessive abstention, negatively impacting effective performance. The findings underscore the importance of prompt sensitivity as a critical aspect of LLM evaluation and deployment, suggesting that it should be treated as a first-class property in vulnerability detection tasks.
Methodology
The authors developed PromptAudit, a controlled evaluation framework that isolates the effects of different prompting strategies by fixing other variables such as dataset and decoding parameters. They evaluated five prompting strategies (zero-shot, few-shot, chain-of-thought, adaptive chain-of-thought, and self-consistency) across five open-weight models using a dataset of 1,000 CVEs, measuring performance metrics including accuracy, recall, abstention, coverage, and effective F1.
Results
The study found that standard chain-of-thought prompting achieved the highest operational performance, while few-shot prompting provided benefits that varied by model. Adaptive chain-of-thought prompting often suppressed recall, and self-consistency led to excessive abstention, negatively affecting effective performance. The results highlighted that prompt sensitivity is a significant factor influencing vulnerability detection outcomes.
Implications
The findings suggest that prompt design is crucial for the reliability of LLM-based vulnerability detection systems. By understanding and characterizing prompt sensitivity, developers can improve the operational security of these systems and enhance their effectiveness in real-world applications. The PromptAudit framework also facilitates reproducible research and comparison across studies in this domain.
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
Optimization
Efficient ML
- Introduces a reparametrization of preconditioners in Shampoo-based methods to support BFP16 storage.
- Reduces computational overhead by updating only part of the basis through QR decomposition in a subspace.
- Improves performance of SOAP and KL-SOAP methods, closing the performance gap with KL-Shampoo.
- Compatible with various subspace selection strategies, enhancing flexibility in optimization.
Read more
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
Summary
This paper addresses the computational inefficiencies associated with Shampoo-based optimization methods, specifically KL-Shampoo and SOAP, which rely on QR decomposition for training neural networks. Traditional QR implementations necessitate single-precision (FP32) arithmetic, making them costly in terms of time and memory, particularly with large preconditioning matrices. The authors propose a novel reparametrization of the preconditioner that allows for BFloat16 (BFP16) storage without sacrificing performance. By updating only a subset of the basis vectors through QR decomposition in a subspace, the proposed method reduces computational overhead and mitigates performance degradation linked to BFP16 storage. This reparametrization is compatible with various subspace selection strategies and enhances the performance of SOAP and KL-SOAP, enabling KL-SOAP to match or exceed the performance of KL-Shampoo. Overall, the approach significantly improves the efficiency of Shampoo-based methods in terms of memory and computation.
Methodology
The authors propose a reparametrization that stores {Ξ»i, Qi, Pi} instead of {Ξ»i, Qi, Si}, where Pi is derived from the preconditioning matrix Si. This allows for efficient updates of a subset of orthogonal basis vectors in Qi via QR decomposition in subspaces of Pi, enabling the use of BFP16 storage while maintaining performance. The method is empirically validated across various Shampoo-based optimization techniques.
Results
Empirical results show that the reparametrized methods outperform traditional implementations, particularly in scenarios utilizing BFP16 storage. The performance of KL-SOAP is notably improved, allowing it to match or exceed KL-Shampoo's performance, demonstrating the effectiveness of the proposed reparametrization.
Implications
This work has significant implications for optimizing neural network training, particularly in resource-constrained environments where memory efficiency is critical. The proposed methods can lead to faster training times and lower memory usage, making advanced optimization techniques more accessible for large-scale applications.
Innovation: An Almost Characterization of Hallucination
NLP
Large Language Models
Theory
- Introduces 'innovation' as a simpler property related to hallucination in LLMs.
- Establishes a relationship between innovation and hallucination, showing they are nearly equivalent.
- Provides new lower bounds on hallucination rates based on the innovation rate.
- Demonstrates that increasing training data does not eliminate hallucination once innovation occurs.
Read more
Innovation: An Almost Characterization of Hallucination
Summary
This paper addresses the phenomenon of hallucination in large language models (LLMs), which refers to the generation of plausible but factually incorrect statements. Building on the probabilistic framework established by Kalai and Vempala, the authors introduce a new property called 'innovation' that quantifies a model's tendency to produce outputs not present in the training data. The paper establishes that innovation is closely related to hallucination, such that hallucination implies innovation and vice versa, with high probability. The authors provide lower bounds on the hallucination rate based on the innovation rate, demonstrating that increasing training data alone cannot eliminate hallucination once innovation occurs. Furthermore, they relate the innovation rate back to the concept of missing mass, yielding new lower bounds on hallucination that extend previous results. This work not only simplifies the understanding of hallucination in LLMs but also provides a framework for future research on mitigating this issue.
Methodology
The authors develop a theoretical framework that builds on the work of Kalai and Vempala, using probabilistic modeling to define and analyze the concepts of calibration, innovation, and hallucination. They derive implications and bounds through rigorous mathematical reasoning and statistical analysis.
Results
The main results include the establishment of innovation as a nearly characterizing property of hallucination, the derivation of lower bounds on hallucination rates based on the innovation rate, and the demonstration that hallucination cannot be eliminated merely by increasing the size of the training corpus.
Implications
The findings suggest that understanding and mitigating hallucination in LLMs requires a focus on the innovation rate, rather than solely on calibration or corpus size. This could influence future research directions in LLM development and the design of training datasets.
Variational Inference for Evidential Deep Learning
Theory
Interpretability
Computer Vision
- Introduces a principled variational framework for Evidential Deep Learning (VI-EDL).
- Derives an Evidence Lower Bound (ELBO) to control evidence growth and enhance uncertainty quantification.
- Establishes theoretical generalization guarantees, validating the heuristic parameter setting in conventional EDL.
- Demonstrates state-of-the-art performance in various applications, including out-of-distribution detection and noise detection.
Read more
Variational Inference for Evidential Deep Learning
Summary
This paper addresses the limitations of conventional Evidential Deep Learning (EDL), which quantifies uncertainty in predictions by modeling class probabilities as a Dirichlet distribution. The authors identify two main issues: the Kullback-Leibler (KL) penalty only suppresses evidence of negative classes, leading to inflated evidence and reduced uncertainty quantification, and the lack of theoretical justification for the Dirichlet parameter setting. To overcome these challenges, the authors propose a new framework called Variational Inference Evidential Deep Learning (VI-EDL). This framework reformulates evidential learning using variational inference, deriving an Evidence Lower Bound (ELBO) that prevents excessive evidence growth. The paper rigorously establishes a generalization bound, demonstrating how predicted uncertainty, feature complexity, and network complexity influence this bound. The authors validate their approach through extensive experiments on visual and medical datasets, showing that VI-EDL achieves state-of-the-art performance in tasks such as out-of-distribution detection and noise detection, particularly in safety-critical applications like autonomous driving.
Methodology
The authors reformulate the EDL framework using variational inference, deriving an ELBO that serves as a mathematically rigorous objective. They also introduce a cosine prototype layer to regulate evidence magnitude and provide a theoretical basis for the Dirichlet parameterization through Bayesian conjugate updating.
Results
The proposed VI-EDL framework outperforms conventional EDL and other methods on standard visual and medical datasets, achieving significant improvements in out-of-distribution detection and noise detection tasks. The theoretical analysis confirms that the choice of Dirichlet parameter minimizes generalization error.
Implications
The findings suggest that VI-EDL can be effectively applied in safety-critical domains such as autonomous driving and biomedical applications, where accurate uncertainty quantification is crucial for decision-making.
Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
Optimization
- Introduction of SILO, a framework for protein design under oracle budgets.
- Utilization of hierarchical edit policy for structured mutation proposals.
- Implementation of incremental stochastic beam search and UCB-based selection for candidate evaluation.
- Demonstrated superior performance across multiple protein fitness landscapes compared to existing methods.
Read more
Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
Summary
This paper presents SILO, a novel framework for protein sequence optimization that addresses the challenges of working under tight oracle budgets. Traditional reinforcement learning and generative approaches often struggle with surrogate noise and position-agnostic mutations, which can disrupt critical residues. SILO introduces a hierarchical edit policy that breaks down mutations into position and residue choices, enhancing the exploration of the protein sequence space. The framework employs incremental stochastic beam search (SBS) for trajectory generation and utilizes a UCB-based proxy ensemble combined with an alanine-scan fitness score (AFS) to prioritize functionally relevant edits for evaluation. The policy is updated using next-action cross-entropy imitation based on the best oracle-labeled trajectories, avoiding the pitfalls of value-function estimation. The empirical results demonstrate that SILO outperforms five strong baselines across eight protein fitness landscapes, achieving the highest maximum and top-100 mean fitness scores. It shows resilience in low-data and noisy-proxy scenarios, maintaining competitive performance where other methods degrade. The findings suggest that the combination of structured sampling and trajectory-level imitation learning significantly enhances protein design efficiency under strict evaluation constraints.
Methodology
SILO employs a hierarchical edit policy that decomposes mutations into position and residue choices. It utilizes incremental stochastic beam search (SBS) for generating diverse candidate trajectories and a UCB-based acquisition function augmented by an alanine-scan fitness score (AFS) to prioritize functionally relevant mutations. The policy is updated through next-action cross-entropy imitation learning based on the best-performing trajectories from oracle evaluations.
Results
SILO achieved the highest maximum and top-100 mean fitness on all eight evaluated protein fitness landscapes, outperforming five strong baselines. In scenarios with low data and noisy proxies, SILO maintained competitive performance, demonstrating its robustness and efficiency in protein design tasks.
Implications
The findings suggest that SILO can significantly improve the efficiency of protein design processes, particularly in applications requiring rapid and accurate evaluations under strict resource constraints. This could have implications for therapeutic protein development and enzyme engineering.
Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition
Generative Models
- GO-Flow introduces a manifold-aware approach to molecular conformation generation, decomposing the process into translation, rotation, and conformation subspaces.
- The method employs tailored flow matching objectives that respect the geometry of each subspace, avoiding the need for models to relearn basic geometric constraints.
- GO-Flow achieves state-of-the-art performance in generating molecular conformations, demonstrating high fidelity with as few as 50 sampling steps.
- The framework encourages rotation-consistent generation and improves geometric validity by integrating physical inductive biases.
Read more
Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition
Summary
The paper addresses the challenge of generating accurate 3D molecular conformations, which is crucial for computational chemistry and drug discovery. Traditional methods often treat molecules as unstructured point clouds in Cartesian space, neglecting the hierarchical nature of molecular mechanics, which can lead to physically implausible structures. The authors propose a novel framework called GO-Flow, which aligns generative modeling with molecular geometry through manifold decomposition. GO-Flow decomposes the generation process into three subspaces: translation, rotation, and conformation, each modeled with tailored flow matching objectives. This approach incorporates geometric inductive biases, enhancing the model's ability to generate chemically valid conformations. The paper demonstrates that GO-Flow achieves state-of-the-art generation quality on benchmark datasets, enabling high-fidelity sampling with significantly fewer steps compared to existing methods, thus improving both structural precision and computational efficiency.
Methodology
The authors propose GO-Flow, which decomposes the molecular generation process into three physically motivated subspaces: translation space using linear optimal transport, rotation space modeled with geodesic flows on the SO(3) manifold, and conformation space utilizing entropic optimal transport. This decomposition allows for the injection of geometric inductive biases into the generative process, enhancing the model's alignment with molecular mechanics.
Results
Extensive experiments on the GEOM-Drugs and GEOM-QM9 datasets show that GO-Flow achieves state-of-the-art generation quality, with the ability to generate high-fidelity molecular conformations using significantly fewer sampling steps compared to existing methods.
Implications
The proposed framework has potential applications in computational chemistry and drug discovery, where accurate molecular conformation generation is critical. By improving the efficiency and accuracy of generative models, GO-Flow could facilitate faster drug design and virtual screening processes.
Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting
Time Series
- Deep ZakaiJ integrates the Zakai nonlinear filtering equation into a neural framework for structured inference in jump-diffusion systems.
- The encoder employs a three-step process for belief updating, achieving first-order accuracy in filtering evolution.
- The decoder is designed to parameterize key dynamics conditioned on filtered beliefs, enhancing interpretability.
- Empirical results show improved distributional quality and well-calibrated predictive intervals across various datasets.
Read more
Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting
Summary
The paper introduces Deep ZakaiJ, a novel framework for forecasting time series characterized by unobserved latent states and abrupt jump discontinuities. Traditional jump-diffusion models are limited by rigid parametric forms, while recent neural models often fail to infer hidden states. Deep ZakaiJ addresses this gap by embedding the Zakai nonlinear filtering equation into a neural encoder-decoder architecture. The encoder updates beliefs about the latent state through a three-step process: prior propagation, diffusion innovation, and jump innovation, achieving a first-order accurate approximation of the filtering evolution. The decoder, a structured jump-diffusion model, conditions on the filtered belief to maintain a clear distinction between continuous dynamics and discrete shocks. Experiments on synthetic, financial, and oceanographic datasets demonstrate that Deep ZakaiJ enhances distributional forecasts, maintains competitive point accuracy, and achieves well-calibrated predictive intervals while recovering interpretable latent structures.
Methodology
Deep ZakaiJ employs a neural encoder-decoder architecture where the encoder discretizes the Zakai equation using Strang splitting into three interpretable steps. The decoder is a structured jump-diffusion model that learns dynamics conditioned on the filtered latent belief, allowing for effective forecasting of jump-diffusion processes.
Results
The framework demonstrated significant improvements in distributional forecasting quality while remaining competitive in point accuracy. It achieved well-calibrated predictive intervals near the nominal 90% level and successfully recovered interpretable latent structures in various datasets, including synthetic, financial, and oceanographic data.
Implications
Deep ZakaiJ has potential applications in fields where time series data exhibit abrupt changes, such as finance and environmental monitoring. Its ability to provide interpretable latent structures and accurate forecasts can enhance decision-making processes in these domains.
Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
Theory
Interpretability
- TabPFN outperforms traditional models in data-scarce environments for childhood anemia prediction.
- Predictive performance is constrained by cross-population heterogeneity rather than model architecture.
- The study highlights the importance of addressing population-level structural challenges in global health.
- Feature importance analysis identifies child age, altitude, and height-for-age as key predictors of anemia.
Read more
Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
Summary
This paper addresses the challenge of predicting childhood anemia, which affects approximately 40% of children aged 6β59 months globally, particularly in data-scarce settings. The authors explore the effectiveness of a transformer-based tabular foundation model (TabPFN) in comparison to traditional supervised machine learning methods (Logistic Regression, XGBoost, and LightGBM) across diverse countries. Utilizing Demographic and Health Surveys (DHS) data from 16 countries, the study evaluates model performance under varying data conditions, particularly focusing on few-shot learning scenarios. The findings reveal that TabPFN consistently outperforms traditional models in data-scarce environments, demonstrating superior discrimination and probabilistic calibration. The study emphasizes that predictive performance is more influenced by cross-population heterogeneity and data characteristics than by the choice of model architecture. The results suggest that foundation models like TabPFN can significantly enhance anemia prediction in low-resource settings, shifting the focus from model selection to addressing structural challenges in population health.
Methodology
The study conducted a multi-country prediction analysis using harmonized data from Demographic and Health Surveys (DHS) involving 68,856 children aged 6β59 months. Various machine learning models were trained and evaluated, including Logistic Regression, XGBoost, and LightGBM, alongside the TabPFN model in an in-context learning setting. Performance metrics included AUC-ROC, Brier score, and expected calibration error, with robustness assessed through few-shot learning and leave-one-country-out validation.
Results
TabPFN demonstrated superior performance in data-scarce conditions, achieving a mean Brier score of 0.203 and an expected calibration error of 0.042 across 16 countries. Traditional models showed rapid performance degradation with fewer training samples. Under full-data conditions, AUC-ROC values ranged from 0.59 to 0.76, with significant variability across countries. The study found that predictive performance was primarily influenced by the target country rather than the model used.
Implications
The findings suggest that transformer-based foundation models can significantly improve predictive capabilities for childhood anemia in low-resource settings. This approach may lead to better health outcomes by providing more accurate risk assessments and guiding interventions in diverse epidemiological contexts.
Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback
Optimization
Theory
- OGD achieves O(βT) regret for hidden-convex losses under exact gradient feedback, matching the optimal rate for adversarial online convex optimization.
- The paper introduces a necessary-and-sufficient Hessian compatibility condition, expanding the class of reparameterizations that can be used.
- A lower bound is established, demonstrating that without Hessian compatibility, OGD can incur β¦(T) regret.
- The analysis is extended to bandit feedback, achieving a regret bound of O(T^(3/4)).
Read more
Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback
Summary
This paper investigates adversarial online learning with hidden-convex losses, which are nonconvex losses that can be transformed into convex ones through a nonlinear reparameterization. Building on previous work by Ghai et al. (2022), which established that online gradient descent (OGD) on these nonconvex losses can simulate online mirror descent (OMD) on the corresponding convex losses, yielding a regret bound of O(T^(2/3)), the authors demonstrate that OGD can achieve the optimal regret of O(βT) under certain geometric and smoothness conditions. They refine the understanding of the necessary geometric conditions for algorithmic equivalence by replacing the diagonal-Jacobian condition with a more general Hessian compatibility condition, thus broadening the class of admissible reparameterizations. The authors also provide a lower bound showing that Hessian compatibility is essential for achieving sublinear regret, and they extend their analysis to one-point bandit feedback, proving a regret bound of O(T^(3/4)).
Methodology
The authors employ a sharper discrete-time algorithmic equivalence argument to analyze the performance of OGD on hidden-convex losses. They establish necessary geometric conditions for the equivalence with OMD and provide both upper and lower bounds on regret. The analysis includes a detailed examination of Hessian compatibility and its implications for regret performance.
Results
The main results include proving that OGD achieves O(βT) regret for hidden-convex losses under specific conditions, establishing a necessary and sufficient condition for Hessian compatibility, and demonstrating that without this condition, OGD can incur linear regret. Additionally, the authors provide a regret bound of O(T^(3/4)) for bandit feedback scenarios.
Implications
The findings have significant implications for online learning algorithms, particularly in settings where losses exhibit hidden convexity, such as neural network training and reinforcement learning. The results suggest that with appropriate reparameterizations, first-order methods can achieve optimal performance even in nonconvex scenarios.
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Time Series
- SL-BiLEM effectively integrates behavioral dynamics into epidemic modeling, addressing feedback loops caused by human responses to disease spread.
- The framework shows a 76% improvement over neural-mechanistic baselines and significantly reduces out-of-distribution degradation.
- SL-BiLEM provides 100% bootstrap confidence interval coverage across synthetic counterfactual experiments, demonstrating its robustness.
- The model achieves Treatment Effect Accuracy exceeding 0.85, indicating strong performance in counterfactual policy evaluation.
Read more
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Summary
The paper introduces SL-BiLEM, a novel framework for epidemic forecasting and policy evaluation that addresses the challenge of human behavioral responses to disease spread, which can create feedback loops and distribution shifts that undermine traditional data-driven models. SL-BiLEM incorporates physical constraints to ensure robust extrapolation and decomposes effective transmission into interpretable factors, allowing for both accurate forecasting and counterfactual analysis. The framework is validated on three real-world datasets (cruise ship, school influenza, and school-district COVID-19 surveillance) and demonstrates significant improvements over existing models, particularly in handling policy-induced shifts. The authors highlight the importance of learnable behavioral dynamics, interpretability, and reproducibility in public health decision-making, positioning SL-BiLEM as a valuable tool for accurate predictions and principled intervention planning.
Methodology
SL-BiLEM employs a structured approach to epidemic modeling by decomposing effective transmission into multiple factors influenced by policy, media, and compliance. It incorporates physical constraints to ensure that the learned compliance function adheres to realistic behavioral patterns. The model is calibrated using real-world case data, allowing it to adapt to changing dynamics while maintaining interpretability and transparency.
Results
The framework demonstrated a 76% improvement in forecasting accuracy compared to neural-mechanistic baselines, with only 53% out-of-distribution degradation versus 1142% for traditional neural models. Additionally, it achieved 100% bootstrap confidence interval coverage in synthetic counterfactual experiments and a Treatment Effect Accuracy greater than 0.85, validating its effectiveness in both forecasting and policy evaluation.
Implications
SL-BiLEM has significant implications for public health decision-makers, providing a reliable tool for epidemic forecasting and policy evaluation. Its ability to accurately predict outcomes and evaluate counterfactual scenarios can enhance resource allocation and intervention planning during health crises.
Balancing Plasticity and Stability with Fast and Slow Successor Features
Reinforcement Learning
Robotics
Theory
- Introduces a continual RL setup with smooth, continuous non-stationarity.
- Demonstrates that performance degradation under non-stationarity is primarily due to instability rather than insufficient plasticity.
- Proposes a framework integrating Successor Features with multi-timescale synaptic consolidation.
- Utilizes cross-attention over SFs to provide insights into the distribution of stability and plasticity across temporal dimensions.
Read more
Balancing Plasticity and Stability with Fast and Slow Successor Features
Summary
This paper addresses the challenges faced by deep Reinforcement Learning (RL) agents in adapting to non-stationary environments, particularly those that evolve gradually rather than through abrupt changes. The authors introduce a novel evaluation protocol for continual RL that incorporates naturalistic, continuous non-stationarity, using modified 3D Miniworld and MuJoCo environments. They systematically investigate the balance between stability and plasticity, revealing that methods emphasizing stability, such as synaptic consolidation, outperform those focused solely on plasticity, like parameter resetting. The study further explores the use of Successor Features (SFs) as consolidation targets, demonstrating that applying neuro-inspired synaptic consolidation to SFs enhances performance in continually changing settings. The findings indicate that stability is crucial in gradual learning scenarios, and that multi-timescale consolidation of predictive representations is an effective strategy for managing the stability-plasticity dilemma in RL.
Methodology
The authors modified existing RL environments to create scenarios with gradual non-stationarity, employing stochastic drift processes. They compared various mechanisms that enhance plasticity against those that maintain stability, focusing on the integration of Successor Features with synaptic consolidation across multiple timescales.
Results
The study found that stability-focused methods significantly outperformed plasticity-focused methods in environments with continuous non-stationarity. The integration of SFs with multi-timescale consolidation yielded superior performance, highlighting the importance of stability in gradual learning contexts.
Implications
These findings suggest that RL agents can be designed to better handle real-world, continually changing environments by prioritizing stability and employing multi-timescale learning strategies. This could enhance the adaptability of AI systems in dynamic settings, such as robotics and autonomous systems.
Towards the Connection between Activation Sparsity and Flat Minima
Theory
Efficient ML
- Activation sparsity is linked to the flatness of loss landscapes in deep networks.
- The authors introduce the concept of derivative sparsity, which aids in pruning during backpropagation.
- Proposed modifications can effectively enhance activation sparsity and reduce computational costs.
- Empirical results show at least 36% improvement in inference sparsity and 50% in training sparsity over standard Transformers.
Read more
Towards the Connection between Activation Sparsity and Flat Minima
Summary
This paper investigates the relationship between activation sparsity in multi-layer perceptron (MLP) blocks of Transformers and the concept of flat minima in loss landscapes. The authors argue that existing explanations for activation sparsity rely on strong assumptions that do not hold in standard training of deep networks. They propose that the flatness of loss landscapes is a more general factor influencing activation sparsity. The study establishes that activation sparsity can be expressed as a ratio involving 'augmented flatness' and the product of input norm and activation gradient. This ratio decreases during training, leading to increased sparsity in activations. The authors introduce the concept of derivative sparsity, which is more stable than activation sparsity and facilitates pruning during backpropagation. They suggest several modifications to encourage activation sparsity, including adding bias vectors to input tokens, constraining LayerNorm parameters, and introducing a new activation function called JSReLU. These modifications effectively reduce the ratio and enhance activation sparsity. Experiments on ImageNet-1K and C4 datasets demonstrate significant improvements in inference and training sparsity, indicating potential for reduced computational costs without sacrificing performance.
Methodology
The authors analyze the relationship between activation sparsity and flat minima through theoretical derivations and empirical experiments. They propose modifications to the training process, including the addition of bias vectors, constraints on LayerNorm parameters, and the introduction of a new activation function (JSReLU) to enhance sparsity. The effectiveness of these methods is validated through experiments on standard datasets.
Results
The proposed methods resulted in at least a 36% increase in inference sparsity and a 50% increase in training sparsity compared to vanilla Transformers, demonstrating the effectiveness of the approach in reducing computational costs.
Implications
The findings suggest that understanding and leveraging activation sparsity can lead to more efficient training and inference in deep learning models, potentially reducing energy consumption and computational requirements in various applications.
Ratio-Variance Regularized Policy Optimization
Reinforcement Learning
Large Language Models
Robotics
- R2VPO eliminates the need for binary hard clipping in policy optimization.
- The method preserves critical gradient signals while down-weighting stale data.
- R2VPO shows significant performance improvements across diverse tasks, especially with smaller models.
- The approach enhances sample efficiency by effectively utilizing off-policy data.
Read more
Ratio-Variance Regularized Policy Optimization
Summary
This paper introduces Ratio-Variance Regularized Policy Optimization (R2VPO), a novel approach to on-policy reinforcement learning that addresses the limitations of traditional clipping methods used in policy optimization. Standard methods like Proximal Policy Optimization (PPO) utilize heuristic clipping to enforce trust regions, which can lead to the loss of valuable gradient information from high-return updates. R2VPO proposes a principled alternative by constraining the variance of policy ratios, allowing for a more nuanced approach to policy updates. This method acts as a 'soft brake' on updates, preserving critical information while down-weighting stale data. The authors employ a primal-dual optimization framework to implement this variance constraint, enabling efficient use of off-policy data. Extensive evaluations across various tasks, including large language models (LLMs) and robotic control, demonstrate that R2VPO significantly outperforms traditional clipping-based methods, achieving substantial performance gains and improved sample efficiency, particularly in sparse-reward and dynamic environments.
Methodology
The paper formulates a primal-dual optimization framework that constrains the variance of policy ratios instead of relying on hard clipping. This approach allows for a more flexible and efficient update mechanism, preserving important gradient information while managing the divergence of policy updates.
Results
R2VPO achieves a macro-average relative gain of +35% across various benchmarks, with improvements of up to +138% on smaller models. It consistently outperforms PPO and other strong baselines in both mathematical reasoning tasks and continuous control environments, demonstrating superior exploration capabilities and stability.
Implications
The findings suggest that ratio-variance regularization can serve as a robust foundation for more stable and data-efficient policy optimization in reinforcement learning, with potential applications in fine-tuning large language models and enhancing robotic control systems.
SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
Theory
Interpretability
Efficient ML
- SilIF enhances Isolation Forest by adding a silhouette-based scoring layer.
- Demonstrated a statistically significant improvement in fraud detection performance on the IEEE-CIS benchmark.
- Characterizes conditions under which the silhouette augmentation is beneficial or not.
- Provides open-source code for reproducibility of results.
Read more
SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
Summary
This paper introduces SilIF, a novel augmentation to the Isolation Forest (IF) algorithm aimed at enhancing unsupervised transaction fraud detection. The primary motivation behind SilIF is to leverage the structural information discarded in the traditional IF scoring process. SilIF incorporates a silhouette-based scoring mechanism that evaluates how well a transaction fits within its assigned cluster compared to the nearest alternative. This is achieved by extracting path lengths from each tree in the forest, clustering these 'fingerprints', and calculating silhouette scores. The integration of this silhouette score with the base IF score is controlled by a hyperparameter, Ξ±. The proposed method is evaluated on the IEEE-CIS Fraud Detection benchmark, where SilIF demonstrates a statistically significant improvement over the standard IF, achieving an average increase of +0.0080 AUC-PR across five experimental seeds. However, on a synthetic dataset (Sparkov), the silhouette augmentation did not yield improvements, prompting a discussion on the conditions that influence the effectiveness of the augmentation. The paper emphasizes the ease of deployment and tunability of SilIF, providing code and experimental scripts for reproducibility.
Methodology
SilIF modifies the Isolation Forest algorithm by adding a post-hoc silhouette scoring layer. It computes path lengths from each tree in the forest for each transaction, clusters these path lengths into structural groups, and calculates silhouette scores to assess the fit of each transaction within its cluster. The silhouette score is then combined with the base IF score using a hyperparameter Ξ±, allowing for flexibility in the contribution of the silhouette signal.
Results
On the IEEE-CIS Fraud Detection benchmark, SilIF with Ξ± = 1.0 improved the AUC-PR by +0.0080 on average across five seeds, with statistical significance (p = 0.046). In contrast, on the Sparkov dataset, the silhouette augmentation did not improve performance compared to the plain IF, highlighting the variability in effectiveness across different datasets.
Implications
SilIF presents a practical enhancement to existing anomaly detection methods, particularly in the context of transaction fraud detection where labeled data is scarce. Its tunable nature allows practitioners to adapt the method to specific datasets and conditions, potentially improving fraud detection capabilities in financial institutions.
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Generative Models
Theory
Efficient ML
- Introduction of the GADD algorithm, achieving O(polylog(Ξ΅β1)) sampling complexity.
- No additional training required beyond standard score estimation.
- Demonstrated practical advantages in various generative tasks.
- General framework for analyzing predictor-corrector methods in discrete diffusion models.
Read more
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Summary
This paper introduces the Gibbs-Accelerated Discrete Diffusion (GADD) algorithm, a novel approach to enhance the efficiency of uniform-rate discrete diffusion models, which are commonly used in generative modeling for discrete data. Traditional methods for accelerating these models often require additional training or suffer from slow mixing, leading to inefficient sampling processes. GADD addresses these issues by leveraging the structure of the concrete score function to construct Gibbs posterior likelihoods directly, eliminating the need for extra training beyond standard score estimation. The authors demonstrate that GADD achieves a sampling complexity of O(polylog(Ξ΅β1)), marking a significant improvement over existing methods that only reach O(poly(Ξ΅β1)). Through numerical experiments, the paper showcases GADD's practical advantages in various applications, including synthetic data generation, zero-shot text sampling, and conditional music generation. The results indicate that GADD not only enhances sample quality but also improves wall-clock efficiency compared to standard baselines. Additionally, the authors present a general framework for analyzing predictor-corrector methods in discrete diffusion models, which may have broader implications for future research in this area.
Methodology
The GADD algorithm utilizes a Gibbs-based corrector that constructs posterior likelihoods from the concrete score function, allowing for efficient sampling without additional training. The theoretical analysis employs an induction argument to track error propagation across iterations, providing a framework for controlling global error in predictor-corrector methods.
Results
GADD achieves a sampling complexity of O(polylog(Ξ΅β1)), outperforming previous methods that only achieve O(poly(Ξ΅β1)). Numerical experiments confirm that GADD improves sample quality and efficiency in various applications, including zero-shot text and music generation.
Implications
The findings suggest that GADD can significantly enhance the efficiency of discrete diffusion models in generative tasks, potentially leading to broader applications in natural language processing, music generation, and other domains that rely on discrete data generation.
Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis
Federated Learning
Optimization
Theory
- Proposes a kernel-based integration framework for Data Collaboration analysis.
- Introduces Linear Kernel Integration (LKI) and Nonlinear Kernel Integration (NKI) to handle nonlinear dimensionality reduction.
- Incorporates graph regularization and centering constraints to enhance representation quality.
- Demonstrates improved classification accuracy in image classification tasks using NKI.
Read more
Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis
Summary
This paper addresses the challenges of collaborative analysis of decentralized confidential datasets, which is crucial in fields like healthcare and finance but often hindered by privacy concerns. The authors propose a novel framework for Data Collaboration (DC) analysis that transforms original datasets into privacy-preserving intermediate representations using party-specific obfuscation functions. They identify limitations in existing methods that rely on linear transformations, which can increase reconstruction risk and fail to accurately align representations from nonlinear dimensionality reductions. To overcome these issues, the authors introduce Linear Kernel Integration (LKI) as a linear integration method and then extend it to Nonlinear Kernel Integration (NKI) through kernelization. NKI allows for the integration of intermediate representations obtained via nonlinear dimensionality reduction while ensuring a globally optimal solution through kernel ridge regression and an eigenvalue problem. Additionally, the authors incorporate graph regularization and a centering constraint to enhance the target representation by capturing geometric and target-variable information. Experimental results on image classification tasks demonstrate that NKI significantly improves classification accuracy compared to existing linear methods, particularly under nonlinear dimensionality reduction, while also highlighting the impact of dimensionality reduction choices on accuracy and reconstruction risk.
Methodology
The authors first formulate a linear integration method (LKI) and then kernelize it to create a nonlinear integration method (NKI). They utilize kernel ridge regression and solve an eigenvalue problem to ensure a globally optimal solution. Additionally, they introduce graph regularization and a centering constraint to incorporate geometric and target-variable information into the integration process.
Results
The experimental results indicate that NKI outperforms existing linear integration methods in terms of classification accuracy on image classification tasks, especially when using nonlinear dimensionality reduction. The study also reveals that the choice of dimensionality reduction techniques significantly affects both classification accuracy and reconstruction risk.
Implications
The proposed methods can facilitate secure and effective collaborative analysis of decentralized datasets across various domains, enhancing predictive performance while maintaining data privacy. This framework can be particularly beneficial in sensitive fields like healthcare and finance, where data confidentiality is paramount.
LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models
Theory
Efficient ML
- LUCoS improves predictive performance in low-label tabular learning by selecting context instances based on latent embeddings.
- The method outperforms random selection and traditional tabular space methods across multiple datasets and metrics.
- Instance selection is crucial for TFMs, especially in cold-start scenarios where no labels are available.
- LUCoS demonstrates that defining representativeness in a meaningful representation geometry is essential for effective context selection.
Read more
LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models
Summary
The paper addresses the challenge of instance selection in low-label tabular learning, particularly in the context of Tabular Foundation Models (TFMs) like TabPFN. It highlights that the performance of TFMs is highly sensitive to the choice of labeled context instances, especially in cold-start scenarios where no labels are available. The authors propose LUCoS (Latent Unsupervised Context Selection), which utilizes the latent geometry from embeddings generated by an unsupervised Prior-Fitted Network (PFN) to select representative instances as context. This method contrasts with traditional selection methods that operate in the original tabular space, which often leads to unreliable distance metrics due to the heterogeneous nature of tabular data. The authors evaluate LUCoS on 67 OpenML-CC18 datasets across various low-label budgets, demonstrating that it outperforms random selection and other methods under multiple metrics, including mean AUC, accuracy, and F1 score. The findings suggest that effective unsupervised context selection relies more on the representational geometry than on the complexity of the selection method itself.
Methodology
LUCoS employs a four-stage process: (1) embedding unlabeled training data into a high-dimensional latent space using an unsupervised PFN, (2) selecting representative instances based on geometric criteria in the latent space, (3) mapping selected instances back to the original tabular space for labeling, and (4) using these labeled instances as context for predictions in a supervised TFM.
Results
LUCoS was evaluated on 67 datasets and consistently ranked first in mean AUC, accuracy, and F1 score across various low-label budgets. The results indicate that the method effectively mitigates the failures of original feature space selection, with performance gains attributed to improved coverage and representation space.
Implications
The findings suggest that LUCoS can significantly enhance the performance of TFMs in real-world applications where labeled data is scarce, such as in healthcare and finance. The method's reliance on latent space geometry may also inspire further research into unsupervised learning techniques for tabular data.
Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning
Reinforcement Learning
- Introduction of a latent Q-Barrier shield for safe ICRL deployment.
- The shield utilizes learned context representations and cost critics without parameter updates.
- Proven theoretical guarantees for budget-safe continuations using Q-Barrier conditions.
- Empirical results show improved reward-safety tradeoffs in multiple benchmarks.
Read more
Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning
Summary
This paper introduces a novel approach to safe in-context reinforcement learning (ICRL) by proposing a latent Q-Barrier shield that enhances the reward-safety tradeoff during deployment. The authors identify limitations in existing safe ICRL methods, which primarily rely on pretraining objectives and cost-conditioning without explicit action-level checks against the remaining safety budget. The proposed method learns a context representation, latent dynamics, and an ensemble cost critic before deployment, allowing it to filter or reweight candidate actions based on the remaining budget and predicted future costs. The authors demonstrate a theoretical foundation for their approach through a conditional error-decomposed barrier-margin result, showing that actions satisfying the Q-Barrier condition lead to approximately budget-safe continuations. Empirical evaluations across five benchmarks reveal that the Q-Barrier shield significantly improves deployment-time reward-safety tradeoffs, achieving higher returns in four out of five benchmarks while maintaining or lowering average episode costs across all environments.
Methodology
The authors propose a latent Q-Barrier shielding method that learns context representations, latent dynamics, and an ensemble cost critic prior to deployment. This method filters or reweights actions based on the remaining safety budget and predicted future costs, without requiring parameter updates during deployment. Theoretical analysis is provided to establish the conditions under which the shielded actions maintain budget safety.
Results
The Q-Barrier shielding method outperformed a strong baseline in safe ICRL across five benchmarks. It achieved higher returns in four out of five environments while matching or lowering the average episode cost in all cases. The method demonstrated robustness in maintaining budget constraints, particularly in scenarios with varying budget levels.
Implications
The proposed method has significant implications for deploying reinforcement learning agents in real-world scenarios where safety constraints are critical. It enhances the adaptability of agents to new tasks while ensuring compliance with safety budgets, making it suitable for applications in robotics, autonomous systems, and other safety-sensitive domains.
Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability
Optimization
- Introduction of machine learning surrogates for low-thrust trajectory design to reduce computational costs.
- Demonstration of a scaling law where performance improves with increased dataset size and model capacity.
- Development of a large-scale dataset using a homotopy-ray strategy for mission design.
- Implementation of a self-similar transformation for generalization across diverse orbital scenarios.
Read more
Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability
Summary
This paper addresses the challenges of low-thrust trajectory design, which traditionally requires extensive computational resources for optimal control solutions. The authors propose using machine learning surrogates to approximate critical performance indicators such as fuel consumption and transfer feasibility, enabling rapid evaluations across various mission scenarios. They demonstrate that the performance of low-thrust trajectory optimization improves linearly with the logarithm of both dataset size and model capacity, indicating a scaling law without saturation in the explored range. To leverage this finding, a large-scale dataset is constructed using a homotopy-ray strategy tailored for mission design. A novel self-similar transformation is introduced, allowing the same neural model to generalize across different orbital environments without the need for retraining. The proposed models show high accuracy in predicting optimal fuel consumption and minimum transfer times for both single- and multi-revolution transfers. Their effectiveness is validated on a public dataset, a multi-asteroid flyby problem, and an asteroid rendezvous mission. The authors also release their models and datasets as open-source resources to benefit the space community.
Methodology
The authors constructed a large-scale dataset using a homotopy-ray data generation strategy and introduced a self-similar transformation to enable generalization across different mission scenarios. They employed neural networks as approximators to predict trajectory-level performance indicators, focusing on fuel consumption and transfer time.
Results
The proposed models achieved superior accuracy in predicting optimal fuel consumption and minimum transfer times compared to existing methods. Their performance was validated on various scenarios, including a public dataset and specific mission designs, demonstrating effective generalization capabilities.
Implications
The findings suggest that machine learning can significantly enhance the efficiency of low-thrust trajectory design, making it more accessible for mission planners. The open-source release of models and datasets fosters collaboration and innovation within the space research community.
The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery
Time Series
Optimization
Interpretability
- Kalman Evolve framework optimizes both noise parameters and update structure of the Kalman Filter.
- Introduces interpretable, non-affine modifications to the classical Kalman filter.
- Demonstrates that affine updates are structurally suboptimal in nonlinear sensing models.
- Achieves up to 12% reduction in RMSE compared to strong baselines.
Read more
The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery
Summary
This paper addresses the limitations of the Kalman Filter in state estimation, particularly in nonlinear sensing environments such as Doppler radar and LiDAR, where traditional assumptions of linearity and Gaussian noise often fail. The authors propose a novel framework called Kalman Evolve, which aims to discover improved filtering algorithms by jointly optimizing both noise parameters and the update structure of the Kalman Filter. By leveraging large language models (LLMs) as a structured prior over program space, the framework generates interpretable, non-affine modifications to the classical Kalman filter while maintaining its recursive nature. The paper provides analytical evidence of the suboptimality of affine estimators in common nonlinear sensing scenarios, motivating the need for structure-aware updates. The proposed algorithms were tested across various synthetic and real-world tracking benchmarks, demonstrating consistent improvements over established baselines, including the Optimized Kalman Filter, with reductions in root mean square error (RMSE) of up to 12%. The findings suggest that optimizing the structure of the Kalman filter, alongside its parameters, is a practical and interpretable approach to enhance state estimation performance.
Methodology
The authors employ an LLM-assisted evolutionary search to discover improved update structures for the Kalman Filter. They first estimate the process and measurement noise covariances to establish a competitive baseline and then optimize both the noise parameters and the filter structure to enhance performance in realistic scenarios with unknown dynamics.
Results
The discovered algorithms consistently outperform strong baselines, including the Optimized Kalman Filter, achieving up to a 12% reduction in RMSE across various synthetic and real-world benchmarks, while maintaining comparable computational costs.
Implications
The findings indicate that a combined approach of optimizing both the structure and parameters of the Kalman Filter can significantly enhance state estimation in practical applications, particularly in complex sensing environments. This could lead to improved performance in fields such as robotics, autonomous vehicles, and signal processing.
Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training
NLP
Large Language Models
Efficient ML
- Identification of two orthogonal failure modes in QAT: amax saturation and catastrophic forgetting.
- Introduction of a max-algorithm DTS strategy for improved scale estimation.
- Development of a two-phase training protocol to stabilize learning and preserve pretrained knowledge.
- Comprehensive failure analysis revealing the limitations of existing scaling methods.
Read more
Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training
Summary
This paper investigates the challenges of quantization-aware training (QAT) using the HiF8 W8A8 format for large language models (LLMs), specifically focusing on the OpenPangu-Embedded-1B model. The authors identify two critical failure modes that arise during QAT: amax saturation, which corrupts knowledge-sensitive representations due to forward-pass clipping, and catastrophic forgetting, where an aggressive learning rate overwrites pretrained knowledge. These issues are not detectable through standard training loss metrics. To address amax saturation, the authors propose a conservative max-algorithm DTS strategy that utilizes a 64-step history window for scale estimation. Additionally, they introduce a two-phase training protocol that includes a 500-step BF16 warmup followed by QAT at a reduced learning rate of 10^-5 to mitigate catastrophic forgetting. The paper documents a comprehensive failure analysis across eight controlled experiments, demonstrating that the interaction between scale stability and learning dynamics is crucial for effective HiF8 QAT. The final configuration achieves near-lossless quantization with minimal degradation in performance metrics compared to a BF16 baseline.
Methodology
The authors conducted a systematic investigation involving eight controlled experiments to analyze the effects of different scaling strategies and learning rates on the performance of HiF8 W8A8 QAT. They implemented a max-algorithm DTS strategy for scale estimation and a two-phase training protocol to stabilize the model's learning process.
Results
The proposed configuration resulted in a 0.43% drop in MMLU, a 0.58% drop in HellaSwag, and a 0.22% drop in ARC-Challenge compared to a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps. The findings indicate that both the DTS fix and the learning rate reduction are necessary for achieving optimal performance.
Implications
The findings suggest that careful management of scaling strategies and learning rates is essential for effective quantization-aware training of large language models, especially in resource-constrained environments. This work could inform future research and applications in deploying LLMs on edge devices and in real-time inference scenarios.
Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences
Generative Models
- Introduces a geometric characterization of local memorization in diffusion models.
- Proposes curvature-difference methods to isolate overfitting-driven memorization.
- Derives a score-difference proxy that unifies existing memorization detection metrics.
- Empirical results show improved localization of memorized regions in Stable Diffusion.
Read more
Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences
Summary
This paper addresses the issue of memorization in diffusion models, which can lead to privacy and copyright concerns. While existing methods for detecting memorization often rely on global signals, they fail to provide insights into the specific regions of generated images that exhibit memorization. The authors propose a geometric characterization of local memorization as a coordinate-wise variance collapse, which can be confused with intrinsic data constraints. To isolate overfitting-driven memorization, they introduce curvature-difference methods that subtract curvature from an underfitted baseline model. Additionally, they derive a score-difference proxy that offers a geometric interpretation of the score-difference-based detection metric. The proposed methods are empirically validated on Stable Diffusion, demonstrating superior performance in localizing memorized regions compared to previous attention-based methods.
Methodology
The authors extend existing geometric frameworks by characterizing local memorization as a coordinate-wise variance collapse. They introduce curvature-difference methods that subtract curvature from underfitted models to isolate overfitting effects. They also provide a geometric interpretation of the widely used score-difference detection metric.
Results
The proposed curvature-difference methods significantly improve the localization of memorized regions in generated images from Stable Diffusion, outperforming the previous attention-based localization method. The empirical evaluation is supported by ground-truth memorization masks.
Implications
This work has important implications for enhancing the interpretability and reliability of diffusion models in terms of privacy and copyright concerns. By providing a method to localize memorization, it can help developers and researchers better understand and mitigate the risks associated with model memorization.
Trust Region Q Adjoint Matching
Reinforcement Learning
Optimization
Theory
- TRQAM introduces a trust-region parameter to stabilize off-policy fine-tuning of pretrained flow policies.
- The method effectively controls the KL divergence between fine-tuned and pretrained policies, preventing destructive drift.
- Theoretical results demonstrate that the KL divergence can be explicitly modeled as a function of the trust-region parameter.
- TRQAM outperforms existing methods in offline RL tasks, achieving a notable success rate of 68%.
Read more
Trust Region Q Adjoint Matching
Summary
This paper addresses the challenges of off-policy reinforcement learning (RL) with pretrained flow policies, particularly the instability caused by multi-step sampling processes. The authors introduce Trust Region Q-Adjoint Matching (TRQAM), a novel algorithm that enhances the stability of off-policy fine-tuning by incorporating a trust-region parameter into the stochastic optimal control (SOC) dynamics. This approach allows for adaptive control of the path-space Kullback-Leibler (KL) divergence between the fine-tuned and pretrained policies, mitigating the amplification of critic errors that can lead to model collapse. The theoretical foundation of TRQAM is established through Girsanov's theorem, demonstrating that the KL divergence can be expressed as a closed-form function of the trust-region parameter. Empirical evaluations on 50 OGBench tasks reveal that TRQAM significantly outperforms existing methods in both offline and offline-to-online RL settings, achieving a success rate of 68% compared to the strongest baseline's 46%.
Methodology
The authors propose TRQAM, which integrates a trust-region parameter into the SOC dynamics and adapts it using projected dual descent. This method enforces a target KL bound at the sampling level, rather than relying on conventional loss-level penalties, thereby providing a more robust control over policy deviations.
Results
TRQAM was tested on 50 OGBench tasks, consistently outperforming prior methods in both offline RL and offline-to-online RL scenarios. The algorithm achieved an overall success rate of 68% in offline RL, significantly surpassing the strongest baseline's success rate of 46%.
Implications
The findings suggest that TRQAM could be applied to enhance the stability and performance of off-policy RL in various applications, particularly those involving pretrained models in complex environments. This could lead to more reliable and effective deployment of RL systems in real-world scenarios.
HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
Time Series
- HRVConformer processes raw heart rate signals directly, eliminating the need for handcrafted features.
- The architecture integrates convolutional and Transformer components to enhance classification performance.
- The model was trained on a comprehensive dataset, demonstrating robust performance metrics.
- HRVConformer outperformed existing baseline models in HIE classification tasks.
Read more
HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
Summary
The paper introduces HRVConformer, a novel deep learning architecture designed for classifying hypoxic-ischemic encephalopathy (HIE) using instantaneous heart rate (HR) signals. Unlike traditional methods that depend on handcrafted features, HRVConformer processes raw HR signals in an end-to-end fashion, leveraging a hybrid Convolution-Transformer framework to capture both local and long-range dependencies. This architecture combines convolutional layers for local feature extraction with Transformer-based attention mechanisms for global context modeling, enhancing the representation and classification of HR signals. The model was trained on a large dataset comprising 1,573 one-hour epochs, including both expert-annotated and weakly labeled data, with a robust validation set of 314 hours and an independent test set of 215 hours. The heart rate signals were extracted from ECG recordings using an improved Pan-Tompkins algorithm, which significantly improved signal quality. Experimental results showed that HRVConformer achieved an AUC of 83.23% and an accuracy of 74.56% on the test set, outperforming baseline models including Transformer, ResNet50, and fully convolutional networks. This work represents a significant advancement towards automated and accurate assessment of HIE using HR signals.
Methodology
The HRVConformer architecture employs a hybrid Convolution-Transformer framework, utilizing convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modeling. The model was trained using supervised learning on a large dataset of heart rate signals extracted from ECG recordings, processed with an enhanced Pan-Tompkins algorithm.
Results
HRVConformer achieved an AUC of 83.23% and an accuracy of 74.56% on the independent test set, surpassing the performance of baseline models such as Transformer, ResNet50, and fully convolutional networks.
Implications
The proposed method has the potential to facilitate more accurate and automated assessments of hypoxic-ischemic encephalopathy in neonates, which could lead to improved clinical decision-making and better long-term outcomes for affected infants.
Symbolic Regression via Latent Iterative Refinement
Theory
Interpretability
Optimization
- Introduces Latent Equation Embedding (LEE) for symbolic regression.
- Addresses the amortization gap in neural symbolic regression through iterative inference.
- Combines iterative refinement with gradient descent for improved robustness.
- Achieves 2-10 times simpler expressions compared to leading baselines.
Read more
Symbolic Regression via Latent Iterative Refinement
Summary
This paper introduces a novel framework called Latent Equation Embedding (LEE) for symbolic regression (SR), which aims to derive closed-form mathematical expressions that accurately fit observed data. Traditional neural SR methods utilize a one-shot prediction approach, which often leads to an 'amortization gap' between the predicted expressions and the true posterior. LEE addresses this issue through iterative amortized inference in a functionally-grounded latent space. The framework comprises three main components: an encoder that embeds symbolic tokens and numerical observations into a shared latent vector, an expression decoder that reconstructs formulas from this latent vector, and an evaluation decoder that predicts function values, ensuring the latent space is grounded in functional behavior. During inference, LEE refines the latent representation iteratively by re-encoding decoded expressions alongside observations, progressively improving the estimate. The methodology also incorporates continuous gradient descent with discrete re-encoding, enhancing robustness against noise. Experimental results on the SRBench benchmark demonstrate that LEE generates expressions that are 2-10 times simpler than those produced by leading accuracy-focused baselines, effectively advancing the low-complexity region of the accuracy-complexity Pareto frontier.
Methodology
LEE employs a shared latent space for symbolic and numerical representations, utilizing an encoder to embed data, an expression decoder for formula reconstruction, and an evaluation decoder for function value prediction. The iterative refinement process involves re-encoding expressions with observations to improve latent estimates, interleaved with continuous gradient descent for enhanced optimization.
Results
On the SRBench benchmark, LEE produced expressions that were significantly simpler (2-10 times) than those from the best accuracy-oriented baselines, demonstrating its effectiveness in achieving low-complexity solutions while maintaining accuracy.
Implications
The findings suggest that LEE can be applied to various domains requiring interpretable models, such as scientific modeling, data analysis, and automated theorem proving, where simplicity and interpretability of mathematical expressions are crucial.
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
Optimization
Generative Models
Efficient ML
- PRISM integrates discrete material selection and continuous thickness regression in a single autoregressive transformer model.
- Introduces spectrum prefix conditioning and cumulative-depth Rotary Position Embeddings to enhance model efficiency and accuracy.
- Achieves over 50% reduction in MAE compared to other transformer baselines while using only one-fifth of the parameters.
- State-of-the-art performance with an MAE of 0.010 on in-distribution validation benchmarks.
Read more
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
Summary
The paper introduces PRISM, a novel autoregressive transformer model designed to tackle the inverse problem of multilayer thin-film optical coatings design, which is characterized by a complex combinatorial-continuous optimization challenge. PRISM innovatively combines discrete material selection and continuous thickness regression within a single decoder-only architecture. Key architectural advancements include spectrum prefix conditioning, which allows for efficient target spectrum integration, and cumulative-depth Rotary Position Embeddings (RoPE), which encode the physical depth of the film stack into the positional representation. These innovations enable PRISM to generate thin-film designs layer by layer using causal self-attention, significantly improving efficiency and accuracy compared to traditional optimization methods. The model demonstrates a substantial reduction in mean absolute error (MAE) and outperforms existing neural and classical optimization techniques, making it a promising solution for real-time and high-throughput applications in photonics.
Methodology
PRISM employs a decoder-only autoregressive transformer architecture that utilizes causal self-attention for generating multilayer thin-film designs. It incorporates spectrum prefix conditioning for target spectrum integration and cumulative-depth Rotary Position Embeddings to encode physical depth, allowing for effective joint prediction of material and thickness.
Results
The PRISM-13M model achieved a mean absolute error reduction of over 50% compared to existing transformer models, while the 44M-parameter variant reached state-of-the-art performance with an MAE of 0.010 on validation benchmarks. The model demonstrated significantly faster inference times than classical optimization methods.
Implications
PRISM's advancements in multilayer thin-film design could lead to more efficient and accurate design processes in photonics, enabling real-time applications and high-throughput manufacturing of optical coatings. Its approach may also inspire further research in neural network applications for complex optimization problems.
Probabilistic Recurrent Intention Switching Model
Reinforcement Learning
Robotics
Interpretability
- PRISM replaces traditional memoryless models with a recurrent neural network for intention switching.
- The EM objective decomposes into independent subproblems, allowing for efficient reward recovery.
- PRISM demonstrates high performance on diverse tasks, recovering interpretable intentions without supervision.
- The framework is the first to apply multi-intention IRL to a large-scale robotic manipulation dataset.
Read more
Probabilistic Recurrent Intention Switching Model
Summary
The paper introduces the Probabilistic Recurrent Intention Switching Model (PRISM), a novel framework for inverse reinforcement learning (IRL) that addresses the limitations of traditional methods which assume a single stationary reward function. PRISM utilizes a lightweight recurrent neural network to model intention switching dynamically, allowing for the recovery of multiple goals within a single episode. The authors demonstrate that the expectation-maximization (EM) objective can be decomposed into independent per-intention reward subproblems, each solvable in closed form, leading to efficient computation. PRISM is evaluated across three diverse domains: a non-Markovian gridworld, a mouse labyrinth, and the BridgeData V2 robotic manipulation dataset. The results indicate that PRISM effectively captures temporally coherent intentions and achieves superior log-likelihood scores compared to existing methods, suggesting its applicability in both biological and artificial agents.
Methodology
PRISM employs a recurrent neural network to process observation history and output intention distributions at each timestep. The model's EM objective is proven to decompose into independent per-intention reward subproblems, which can be solved in closed form using inverse action-value iteration, resulting in an efficient O(nK) E-step.
Results
PRISM outperformed existing methods in terms of held-out log-likelihood across all evaluated domains. In the mouse labyrinth, it successfully identified three biologically relevant intentions. The frustration gridworld validated its capability to capture non-Markovian switching, while the BridgeData V2 dataset showcased its ability to discover coherent manipulation phases without supervision.
Implications
The findings suggest that PRISM can be a powerful tool for understanding complex sequential behaviors in both biological and artificial systems. Its ability to recover interpretable reward functions and intentions could enhance applications in robotics, behavioral analysis, and human-robot interaction.
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism
Graph Learning
Time Series
- TGFormer redefines temporal graph learning by treating it as a time-series analysis problem.
- The Series Transformer layer effectively captures long-term dependencies using a Transformer-based architecture.
- The auto-correlation mechanism enhances the model's ability to capture periodic patterns with reduced computational complexity.
- TGFormer consistently outperforms state-of-the-art methods across multiple real-world datasets.
Read more
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism
Summary
The paper introduces TGFormer, a novel Transformer architecture designed specifically for temporal graphs, addressing the limitations of existing Temporal Graph Neural Networks (TGNNs) in capturing long-term dependencies and periodic patterns. TGFormer employs a trajectory framework aligned with time series analysis principles, allowing for a systematic analysis of historical interactions to derive node representations. A key innovation is the auto-correlation mechanism (ACoM), which utilizes stochastic process theory to uncover periodic dependencies in node interactions. This mechanism enhances the model's ability to perform dependency discovery and representation aggregation at sub-interaction levels, improving efficiency and accuracy compared to traditional attention mechanisms. Experimental results on six public benchmarks demonstrate that TGFormer achieves up to 9.35% precision improvement over state-of-the-art methods, validating its effectiveness in modeling complex temporal dynamics.
Methodology
TGFormer employs a Transformer-based architecture to model temporal graphs, integrating a Series Transformer layer for long-term dependency capture and an auto-correlation mechanism to identify periodic patterns. The auto-correlation mechanism transforms time series data into the frequency domain using Fast Fourier Transform (FFT), allowing the model to focus on specific frequency components and preserve high-frequency signals.
Results
TGFormer demonstrated superior performance in experiments conducted on six real-world datasets, achieving up to 9.35% improvement in precision compared to existing state-of-the-art TGNN methods, validating its capability to effectively model both long-term dependencies and periodic structures.
Implications
The advancements presented in TGFormer have significant implications for various applications involving temporal graphs, such as social network analysis, traffic flow prediction, and real-time recommendation systems, where understanding complex temporal dynamics is crucial.
Separate Aggregation of Split Network for Personalized Federated Learning
Federated Learning
- PGFedSplit improves personalization and global generalization in federated learning.
- The framework utilizes a split architecture with adaptive aggregation scheduling.
- Clients benefit from a mix of local and synthetic representations to enhance robustness.
- Extensive experiments show consistent performance improvements over state-of-the-art PFL methods.
Read more
Separate Aggregation of Split Network for Personalized Federated Learning
Summary
The paper presents PGFedSplit, a novel personalized federated learning (PFL) framework designed to enhance both personalization and global generalization in scenarios with significant client data heterogeneity. Traditional federated learning approaches often struggle with the trade-off between global model sharing and local specialization, leading to performance degradation in personalized tasks. PGFedSplit addresses this issue by implementing a split architecture that allows for adaptive aggregation scheduling tailored to the distinct roles of model components. The framework aggregates the representation layer in every communication round while synchronizing the personalization layer through an adaptive periodic scheme. This approach stabilizes knowledge sharing and maintains client-specific adaptations. Additionally, clients utilize a combination of locally extracted and server-generated synthetic representations to improve robustness against label imbalance and missing class conditions. The authors conducted extensive experiments on datasets such as Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet, demonstrating that PGFedSplit consistently outperforms existing PFL methods, achieving stable convergence and superior personalization in highly heterogeneous environments.
Methodology
The proposed PGFedSplit framework employs a split architecture that decouples the synchronization of representation and personalization layers. It aggregates the representation layer in every communication round while adaptively scheduling the synchronization of the personalization layer. This two-stage personalization strategy constructs a low-variance global personalization head and allows for local adaptation using distribution-aware synthetic representations derived from server-side Gaussian statistics.
Results
The experimental results indicate that PGFedSplit achieves significant improvements in personalization and global generalization across various datasets, including Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet. The framework demonstrates stable convergence and outperforms existing PFL methods, particularly in highly heterogeneous client settings.
Implications
The findings suggest that PGFedSplit can be effectively applied in real-world federated learning scenarios where client data distributions are diverse and non-IID. This framework could enhance applications in areas such as mobile device intelligence, healthcare, and any domain requiring privacy-preserving collaborative learning.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
NLP
Large Language Models
- Identifies wrong-majority failure as a critical issue in consensus decoding of language models.
- Introduces ARBITER, a framework that accumulates same-model evidence for challenger basins while treating consensus as a prior.
- Demonstrates that many direct correction strategies degrade performance, while sparse additive evidence can improve accuracy.
- Achieves notable improvements in accuracy across various models and benchmarks, indicating recoverable headroom in sampled outputs.
Read more
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
Summary
The paper introduces ARBITER, a model-agnostic framework designed to address the issue of wrong-majority failures in test-time sampling of language models. When generating multiple reasoning trajectories, these trajectories often cluster into a few dominant reasoning basins, leading to a majority vote that may select a stable but incorrect answer. The authors demonstrate that existing correction strategies often degrade performance instead of improving it. ARBITER operates under a conservative principle where consensus remains the prior, and a challenger basin is only selected when supported by additional same-model evidence. The framework groups sampled trajectories into basins and collects evidence to support challenger basins relative to the dominant one. The results show that ARBITER can recover a significant portion of correct answers that are outvoted, achieving consistent gains across multiple model families and benchmarks without relying on external information.
Methodology
ARBITER groups sampled trajectories into reasoning basins and constructs compact representations of their structures. It collects additional same-model evidence by asking the model to reinterpret and compare solutions under competing basin hypotheses. The approach operates strictly within a zero-external-information framework, relying solely on the model's own outputs and internal representations.
Results
On the GSM8K benchmark with Qwen3-4B, ARBITER achieves around mid-94% accuracy with K=24 samples, while a top-2 oracle reaches mid-96%. For Llama-3.1-8B MMLU-HS-Math, accuracy improves from mid-78% to mid-82%, recovering about 22% of the available oracle headroom. The framework consistently yields gains across three model families and three math benchmarks without any net-negative cases.
Implications
ARBITER's findings suggest that language models can improve their accuracy by better leveraging internal evidence from their own outputs, which could enhance applications in various NLP tasks where consensus-based decision-making is critical. This approach may lead to more robust language models capable of self-correcting in real-time applications.
ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks
Efficient ML
Interpretability
NLP
- ChainzRule architecture utilizes learnable polynomial layers to enhance model flexibility and efficiency.
- Differential Regularization (DREG) imposes a sensitivity budget, improving robustness and interpretability.
- CR achieves competitive results across diverse domains with significantly less training data compared to traditional models.
- The model maintains a consistent gradient tail ratio, indicating reliability and stability during inference.
Read more
ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks
Summary
The paper introduces ChainzRule (CR), a novel neural architecture designed to address the challenges faced by production deep learning systems, particularly in enterprise settings where labeled data is scarce, inference budgets are limited, and model interpretability is crucial. CR replaces traditional activation functions with learnable polynomial layers and employs Differential Regularization (DREG), which imposes a layer-wise Jacobian penalty during the forward pass. This approach encourages the network to learn low-frequency, structurally stable representations, enhancing sample efficiency, robustness to distribution shifts, and providing a gradient-based measure of model behavior. The architecture is evaluated across five domains, demonstrating competitive performance against traditional models while requiring significantly less labeled data. The results indicate that CR not only achieves state-of-the-art accuracy but also maintains a reliable gradient tail ratio, which serves as a proxy for model reliability in deployment. The findings suggest that CR can effectively bridge the gap between academic benchmarks and real-world applications, making it a promising solution for various machine learning tasks.
Methodology
ChainzRule replaces standard activation functions with learnable cubic polynomial layers and employs Differential Regularization (DREG) to penalize the Jacobian of each layer. This allows for simultaneous propagation of activations and their derivatives during the forward pass, enabling the model to learn more stable representations with fewer parameters.
Results
CR achieved 85.71% accuracy on the Pima Diabetes dataset, outperforming SVM and XGBoost. In sentiment classification on SST-5, it reached 46.20% accuracy with a frozen encoder, using only 5% of the training data required by previous benchmarks. Additionally, it achieved 70.17% on Yelp Full ordinal regression and improved mean corruption accuracy on CIFAR-10-C by 2.32%. All results were statistically significant with p-values below the Ξ± = 0.05 threshold.
Implications
The findings suggest that ChainzRule can significantly enhance the efficiency and reliability of deep learning models in production environments, particularly in domains where data is limited and interpretability is essential. Its successful deployment in real-world applications indicates its potential for widespread adoption across various industries.
Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation
NLP
Large Language Models
Efficient ML
- Introduction of context-instrumental data distillation for SLMs in generating Kubernetes manifests.
- Focus on the importance of data quality and validation over quantity in training datasets.
- Achieved a high accuracy rate of 91.5% in generating valid Kubernetes configurations.
- Demonstrated that strict output format requirements significantly impact result quality.
Read more
Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation
Summary
This paper presents a novel approach for generating Kubernetes manifests using Small Language Models (SLMs) specialized through a method termed context-instrumental data distillation. The authors highlight the challenges of generating domain-specific languages (DSL) artifacts, particularly in the context of Kubernetes, where configuration errors can lead to significant operational risks. The proposed method involves creating a source corpus through synthetic generation and reverse instruction generation from real Kubernetes YAML files, ensuring that only validated examples are included in the training set. This approach differs from traditional knowledge distillation methods by focusing on supervised fine-tuning with instrumentally verified examples rather than relying solely on KL-divergence. The experimental evaluation demonstrates the effectiveness of the method using a pilot implementation that achieved a high accuracy rate of 91.5% in generating valid Kubernetes YAML configurations. The findings suggest that the quality of generated outputs is more influenced by strict adherence to output format requirements than by the sheer volume of training data, emphasizing the importance of data quality in DSL tasks.
Methodology
The methodology involves creating a training corpus through synthetic generation and reverse instruction generation from real Kubernetes YAML files. The training examples are validated through external validators to ensure they match the domain context model. The SLM is fine-tuned using a supervised approach on these verified examples, diverging from traditional KL-divergence knowledge distillation methods.
Results
The experimental results showed that the fine-tuned model achieved a full-pass accuracy of 91.5% on the K8s-Distill-Pilot corpus, indicating a strong capability in generating valid Kubernetes YAML configurations under resource-constrained conditions.
Implications
The proposed method has significant implications for automating the generation of infrastructure as code (IaC) configurations, potentially reducing operational risks associated with configuration errors. It also highlights the viability of using SLMs in environments with limited computational resources, making advanced AI capabilities more accessible.
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
NLP
Large Language Models
Reinforcement Learning
- CurveRL formulates prompt reweighting as context distribution control, enhancing understanding of optimal weights.
- The approach uses a distribution-aware utility function in pass-rate quantile space to derive optimal weights.
- Extensive experiments show CurveRL's superior performance in improving pass@1 and pass@k metrics.
- The study emphasizes the significance of context-distribution control in RLVR algorithm design.
Read more
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
Summary
The paper introduces CurveRL, a novel approach to prompt reweighting in Reinforcement Learning with Verified Rewards (RLVR) aimed at enhancing the reasoning capabilities of large language models (LLMs). The authors identify a gap in understanding the principles behind optimal prompt weighting and propose a unified framework that formulates prompt reweighting as a functional derivative of a utility function defined in the pass-rate function space. This framework accommodates existing methods like REINFORCE and GRPO while providing a principled basis for prompt distribution control. CurveRL employs a distribution-aware prompt reweighting strategy based on quantile coordinate transformation, where the weights assigned to prompts depend on their rank and density rather than absolute pass rates. Extensive experiments demonstrate that CurveRL consistently outperforms GRPO and other RLVR baselines across multiple reasoning benchmarks, highlighting the importance of context-distribution control in the design of prompt-reweighted RLVR algorithms.
Methodology
The authors develop a unified optimality framework for prompt reweighting in RLVR, deriving optimal weights as functional derivatives of a utility function. CurveRL is instantiated using a distribution-aware utility that reflects the distributional structure of pass rates, focusing on rank and density information. The methodology includes extensive experimental validation across various benchmarks to assess performance improvements.
Results
The results indicate that CurveRL consistently improves the trade-off between pass@1 and pass@k metrics compared to standard baselines like GRPO. The empirical findings validate the effectiveness of the proposed distribution-aware prompt reweighting approach.
Implications
The findings suggest that context-distribution control is a critical factor in designing effective RLVR algorithms. This could lead to advancements in LLM reasoning capabilities and inform future research on prompt reweighting strategies in reinforcement learning.
Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening
Computer Vision
Interpretability
- Comparison of feature-based and image-based machine learning models for methane plume classification.
- Identification of retrieval artifacts in TROPOMI data that resemble methane emissions.
- Use of SHAP for explainability to interpret model decisions.
- Evaluation of models under both balanced and imbalanced conditions.
Read more
Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening
Summary
This paper addresses the challenge of accurately detecting methane emissions from satellite observations, specifically using data from the TROPOMI instrument on the Sentinel-5 Precursor satellite. The authors highlight that many plume-like features detected in TROPOMI data are actually retrieval artifacts rather than genuine methane emissions. Previous methodologies relied on a Support Vector Machine (SVM) classifier trained on expert-defined features, which limited the model's ability to capture spatial relationships and potentially led to information loss. In this study, the authors compare the performance of various feature-based models (SVC, Random Forest, XGBoost) against image-based models (ResNet-18, ResNet-34) for methane plume-artifact classification. They evaluate these models under both balanced and imbalanced settings to reflect real-world conditions. Additionally, they utilize SHAP (SHapley Additive exPlanations) for model interpretability, allowing for a better understanding of the factors influencing model performance. The findings suggest that image-based models may offer superior performance in distinguishing between actual methane emissions and artifacts, providing insights for operational methane screening workflows such as the CAMS Methane Hotspot Explorer.
Methodology
The study employs a comparative analysis of feature-based models (SVC, Random Forest, XGBoost) and image-based models (ResNet-18, ResNet-34) for classifying methane plumes. The models are evaluated in both balanced and imbalanced settings to assess their performance in realistic scenarios. SHAP is used to provide interpretability for both model types, allowing for an understanding of the contributing factors to their predictions.
Results
The results indicate that image-based models outperform feature-based models in accurately classifying methane plumes versus retrieval artifacts. The use of SHAP for explainability reveals key factors influencing model decisions, enhancing the interpretability of the results. The findings support the development of more effective automated detection methods for methane emissions.
Implications
The study's findings have significant implications for improving the accuracy of methane plume detection from satellite data, which is crucial for monitoring and mitigating global warming. The insights gained can enhance operational workflows like the CAMS Methane Hotspot Explorer, reducing the need for manual inspections and increasing the efficiency of methane emission monitoring.
Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
Optimization
Theory
Efficient ML
- ProMoT generalizes existing smoothing methods by allowing a broader class of symmetric unimodal kernels and introducing ratio-monotone transformations.
- The framework preserves the global maximizer and ensures convergence of stationary points near the true optimum without a decreasing smoothing schedule.
- A leave-one-out variance reduction technique is introduced, improving the iteration complexity of gradient estimation.
- ProMoT demonstrates improved robustness to hyperparameter tuning compared to traditional Gaussian smoothing methods.
Read more
Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
Summary
This paper addresses the challenges of global optimization in high-dimensional and non-convex landscapes, where traditional methods often struggle due to hyperparameter sensitivity and computational overhead. The authors propose a novel framework called Probabilistic Smoothing with Ratio-Monotone Transforms (ProMoT), which combines flexible symmetric unimodal kernels with monotonic ratio-based transformations. This approach preserves the global maximizer of the original objective and ensures that stationary points of the smoothed objective converge near the true optimum without needing a decreasing smoothing schedule. The paper also introduces a variance-reduced variant, ProMoT-loo, which enhances the efficiency of Monte Carlo gradient estimators. Theoretical guarantees are provided for the proposed methods, demonstrating their robustness and competitive performance in various optimization scenarios, including high-dimensional benchmarks and adversarial attacks.
Methodology
The authors develop a general smoothing framework that utilizes symmetric unimodal kernels and ratio-monotone transformations. They analyze the theoretical properties of these methods, showing that they maintain global optimality and improve convergence rates. Additionally, they introduce a variance-reduction technique based on leave-one-out estimation to enhance the efficiency of gradient calculations.
Results
The proposed ProMoT framework achieves Ξ΅-approximate global localization guarantees similar to previous methods while being more robust to hyperparameter settings. The leave-one-out variance reduction method results in a smaller second-moment bound, leading to improved iteration complexity. Experimental results indicate that ProMoT outperforms existing methods in terms of robustness and efficiency across various optimization benchmarks.
Implications
The findings suggest that ProMoT can be effectively applied to complex optimization problems in machine learning and engineering, particularly in scenarios where traditional methods fail due to sensitivity to initialization and hyperparameters. This framework could enhance optimization strategies in fields such as computer vision, robotics, and adversarial machine learning.
Curriculum Learning for Safety Alignment
NLP
Large Language Models
Optimization
- Introduction of Staged-Competence framework for safety alignment in LLMs.
- Demonstrated 16% reduction in OOD harmful response rates and 20% reduction in jailbreak attack success rates.
- Improved data efficiency, achieving baseline safety performance with only 75% of the training data.
- Identified inconsistencies in popular safety datasets and provided a cleaned dataset for training.
Read more
Curriculum Learning for Safety Alignment
Summary
This paper investigates the application of Curriculum Learning to enhance the robustness of Direct Preference Optimization (DPO) for safety alignment in large language models (LLMs). DPO is a prevalent method for aligning model behavior with human preferences, but it has been shown to be brittle and struggles with out-of-distribution (OOD) generalization. The authors introduce a novel framework called Staged-Competence, which organizes preference data by difficulty and employs competence-based sampling to progressively update the reference model during training. The framework aims to improve the model's ability to distinguish between safe and unsafe responses while maintaining general capabilities. The authors conduct experiments across three model families, demonstrating that Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while also achieving better data efficiency by matching baseline safety performance with only 75% of the training data. The study also reveals inconsistencies in existing safety datasets and introduces a cleaned dataset for DPO-based safety training. Overall, the findings suggest that Curriculum Learning can significantly enhance safety alignment in LLMs.
Methodology
The authors propose a curriculum training algorithm called Staged-Competence, which organizes preference data by difficulty using a preference alignment margin. The training process involves competence-based sampling and progressive updates to the reference model, allowing the model to learn robust features for safety alignment more effectively.
Results
The Staged-Competence framework led to a 16% reduction in OOD harmful response rates and a 20% decrease in jailbreak attack success rates across three model families. It also achieved approximately three times greater separation between safe and unsafe responses and matched baseline safety performance with 25% less training data.
Implications
The findings suggest that integrating Curriculum Learning into safety alignment strategies can lead to more robust and efficient models. This approach may have broader applications in various alignment domains beyond safety, enhancing the overall reliability of large language models.
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Large Language Models
Efficient ML
- ReMoE enhances expert reuse in Mixture-of-Experts models through router fine-tuning.
- The method promotes temporal stability in routing to align with cache locality constraints.
- Experiments show a 26% increase in expert reuse without sacrificing task performance.
- Real-system evaluations indicate an 8.4% improvement in throughput and significant reductions in processing time.
Read more
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Summary
The paper introduces ReMoE, a router fine-tuning framework aimed at enhancing expert reuse in memory-constrained inference scenarios for Mixture-of-Experts (MoE) models. MoE architectures activate only a subset of experts per token, which is efficient for computation but poses challenges in memory-limited environments where frequent expert switching can lead to high I/O overhead. ReMoE addresses this issue by modifying the routing behavior of the MoE model to favor recently selected experts, thereby promoting temporal stability in routing and better aligning with cache locality constraints. This approach does not alter the model architecture or inference processes but fine-tunes the router to improve expert overlap and reduce the number of expert fetches from external storage. Experimental results demonstrate that ReMoE significantly increases expert reuse by 26% while maintaining performance on downstream tasks. Real-world evaluations show an 8.4% improvement in output throughput and a substantial reduction in total processing time on various platforms, indicating its effectiveness in enhancing the efficiency of MoE models in edge-device applications.
Methodology
ReMoE employs a lightweight fine-tuning approach that modifies the routing behavior of MoE models using two objectives: encouraging temporal locality for expert reuse and applying a Trust-KL loss to maintain alignment with the pretrained router. This method focuses on optimizing the routing trace to reduce cache turnover and I/O overhead during inference.
Results
The implementation of ReMoE resulted in a 26.4% increase in expert overlap for the DeepSeek model and a 27.2% increase for the Qwen model. Additionally, real-system tests showed an 8.4% improvement in output throughput and a reduction in total processing time (TPOT) by 43.6β49.8% on the Jetson Orin NX platform, achieving a decode speedup of 1.77β1.99 times across various workloads.
Implications
ReMoE has significant implications for deploying large language models on edge devices, where memory constraints and I/O overhead are critical challenges. By improving expert reuse, it enhances the efficiency and responsiveness of AI applications in real-time scenarios, such as translation and photo editing.
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
NLP
Large Language Models
- SemanticZip defines a new approach to lossy text compression using LLMs for semantic decompression.
- The framework distinguishes between protected commitment-preserving channels and lossy channels for safe context compression.
- A small experimental harness is presented to compare various text representation formats.
- Results show a coherent gradient of compression and recoverability, with structured representations performing best.
Read more
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
Summary
The paper introduces SemanticZip, a novel framework for lossy text compression that leverages large language models (LLMs) as semantic decompressors. Unlike traditional compression methods that aim for exact reconstruction, SemanticZip focuses on compressing text into compact codes that retain task-relevant meanings, allowing LLMs to expand these codes into coherent outputs. The authors formalize the concept of LLM-mediated decompression and propose a hybrid architecture that distinguishes between protected channels for critical commitments and lossy channels for less sensitive information. They evaluate six representation regimes across five diagnostic cases, measuring performance through metrics such as Critical Atom Recall (CAR) and Weighted Atom Recall (WAR). The results indicate that structured prose achieves the highest recoverability, while SemanticZip ASCII offers significant compression but with lower semantic fidelity. The paper emphasizes the need for further research to establish benchmark claims and improve the understanding of lossy text compression in LLM contexts.
Methodology
The authors formalized the concept of lossy text compression with LLMs as decompressors and constructed a small experimental framework to evaluate different representation formats. They conducted experiments on six representation regimes and measured performance using metrics like CAR and WAR, focusing on the ability to recover task-relevant semantic information.
Results
The experiments revealed that structured prose had the highest recoverability (WAR=0.956) with a modest token reduction of 19.1%. CCL-Core closely followed with a WAR of 0.948 but offered minimal compression. CCL-Min provided a balanced approach with a 39.4% gain and WAR of 0.874. SemanticZip ASCII achieved the largest compression at 46.5% but with a lower WAR of 0.802, while the emoji-based representation performed the worst in both compression and recovery.
Implications
The findings suggest that lossy text compression can be effectively utilized in LLM applications, particularly for contexts where exact reconstruction is not critical. This approach could lead to more efficient use of context windows in LLMs, potentially reducing costs and improving performance in various applications such as chatbots, automated summarization, and information retrieval systems.
WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Large Language Models
Reinforcement Learning
Efficient ML
- WINDQuant reformulates mixed-precision quantization as a sequential decision-making problem, allowing for adaptive bit-width allocation.
- The framework operates at a fine-grained column-chunk level, improving precision assignment flexibility.
- WINDQuant achieves competitive performance in ultra-low-bit settings on LLaMA models without requiring full retraining.
- The approach integrates reinforcement learning with activation-aware mechanisms for effective quantization.
Read more
WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Summary
WINDQuant introduces a novel approach to quantizing large language models (LLMs) by leveraging reinforcement learning to optimize the allocation of bit-widths for model weights. Traditional quantization methods often face challenges in maintaining model performance, particularly in ultra-low-bit settings. Existing post-training quantization (PTQ) techniques can lead to significant accuracy loss, while quantization-aware training (QAT) methods require extensive retraining and resources. WINDQuant addresses these issues by formulating the quantization process as a sequential decision-making problem, allowing for fine-grained control over the precision of weight matrices. The framework operates at the column-chunk level, enabling adaptive bit-width assignments under a global storage budget. By integrating Proximal Policy Optimization (PPO) with activation-aware calibration and effective-bit accounting, WINDQuant achieves competitive performance on LLaMA models without the need for full model retraining. The results demonstrate that WINDQuant significantly reduces computational overhead while maintaining model accuracy, showcasing the potential of reinforcement learning in adaptive mixed-precision quantization.
Methodology
WINDQuant employs a reinforcement learning framework to optimize the quantization of LLMs. It defines the quantization task as a finite-horizon sequential decision problem, where an agent selects actions (quantization levels) based on observed features of model units. The training utilizes Proximal Policy Optimization (PPO) combined with budget-aware action masking and a quality penalty based on perplexity, enabling fine-grained control over the allocation of bit-widths across model layers.
Results
Experiments conducted on various LLaMA models (1B, 3B, 8B, 70B parameters) demonstrate that WINDQuant achieves competitive performance in ultra-low-bit quantization settings, significantly reducing optimization overhead compared to retraining-based approaches. The results indicate that the proposed method effectively maintains model accuracy while optimizing resource usage.
Implications
The findings suggest that WINDQuant can be applied to enhance the deployment of large language models in resource-constrained environments, making it feasible to utilize advanced models on edge devices. The adaptive quantization strategy could lead to broader applications in various domains requiring efficient model inference.
Function-Valued Causal Influence in Nonlinear Time Series
Time Series
- Scalar edge scores in nonlinear causal discovery obscure the true nature of causal relationships.
- Function-valued causal influence provides a more nuanced understanding of causal effects that vary across states.
- The proposed framework allows for direct estimation of causal response functions from trained models.
- Synthetic experiments reveal that similar scalar scores can correspond to qualitatively different causal mechanisms.
Read more
Function-Valued Causal Influence in Nonlinear Time Series
Summary
This paper addresses the limitations of scalar edge scores in summarizing causal relationships derived from nonlinear time series models. The authors argue that these scalar summaries obscure the true nature of causal influence, which is inherently state-dependent and can vary across different regimes and contexts. They formalize the concept of function-valued causal influence for additive, contribution-decomposable autoregressive models, highlighting that scalar scores represent a significant information bottleneck. The paper introduces a practical framework based on Individual Conditional Expectation (ICE) to estimate causal response functions directly from trained models. Through synthetic experiments, the authors demonstrate that edges with similar scalar scores can exhibit qualitatively different causal behaviors. An applied case study on democratic development illustrates that function-valued analysis can reveal regime-specific and asymmetric causal structures that are often missed by traditional score-centric approaches. This reframing of causal influence is particularly relevant in theory-driven domains, such as social sciences, where understanding the mechanisms and contexts of causality is crucial.
Methodology
The authors formalize function-valued causal influence for a class of additive autoregressive models and introduce a framework based on Individual Conditional Expectation (ICE) for estimating causal response functions. They conduct controlled synthetic experiments to illustrate the differences in causal mechanisms and apply their approach to a panel dataset on democratic development.
Results
The study shows that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, such as monotonic, thresholded, and saturating effects. The analysis of democratic development data reveals regime-dependent causal behaviors that are not captured by scalar scores.
Implications
This work has significant implications for causal discovery in nonlinear time series, particularly in fields like social sciences where understanding the context and mechanisms of causality is critical. It encourages the adoption of function-valued approaches to improve the interpretability and utility of causal models.
RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges
Time Series
- RealBench provides a more accurate evaluation framework for AI weather forecasting models under operational conditions.
- The benchmark eliminates data leakage by using a strictly out-of-distribution test set for 2025.
- It integrates real-time operational analysis and extensive in-situ observations for direct performance evaluation.
- Event-specific metrics are introduced to better assess the forecasting of high-impact extreme weather events.
Read more
RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges
Summary
The paper introduces RealBench, a novel benchmark designed to evaluate AI-based weather forecasting models under realistic operational conditions, addressing the limitations of existing benchmarks that primarily rely on reanalysis products like ERA5. These traditional benchmarks often fail to reflect real-time forecasting constraints and can lead to a mismatch between benchmark performance and actual operational readiness. RealBench features a strictly out-of-distribution test set for the year 2025, which eliminates data leakage and captures recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a comprehensive global in-situ observation dataset with over 10,000 stations, allowing for direct evaluation against actual atmospheric measurements. The benchmark also emphasizes high-impact extreme weather events, providing event-specific metrics that align with real-world forecasting priorities. The evaluation results reveal significant discrepancies between reanalysis-based metrics and real-world performance, particularly for extreme events, highlighting the need for a more operationally relevant evaluation framework. The authors argue that strong scores from ERA5 do not guarantee operational readiness, as station observations and extreme-event metrics reveal substantial performance gaps. This work establishes a rigorous foundation for advancing AI weather forecasting systems, with the benchmark implementation available for public use.
Methodology
RealBench employs a comprehensive evaluation protocol that includes a strictly out-of-distribution test set, real-time operational analysis data, and a large-scale global observation dataset. It emphasizes the assessment of extreme weather events through targeted metrics, moving away from reliance on historical reanalysis data.
Results
The evaluation results demonstrate substantial discrepancies between the performance metrics derived from reanalysis data and those obtained from real-world observations, particularly for extreme weather events. This indicates that traditional benchmarks may not accurately reflect the operational readiness of forecasting models.
Implications
The establishment of RealBench has significant implications for the development and deployment of AI weather forecasting systems, ensuring that models are rigorously evaluated under conditions that closely mimic real-world scenarios. This can lead to improved forecasting accuracy and reliability, ultimately benefiting sectors reliant on accurate weather predictions.
DriftingMol: Decoder-Coupled Drift for One-Pass Property-Conditional Molecular Generation
Generative Models
- Introduction of a SELFIES latent drifting pipeline for efficient molecular generation.
- Development of decoder-coupled drift, which utilizes a frozen VAE decoder for gradient preservation.
- Demonstration of superior performance in property control and uniqueness compared to other drift variants.
- Validation of the method through extensive ablation studies and protocol-matched conditioning.
Read more
DriftingMol: Decoder-Coupled Drift for One-Pass Property-Conditional Molecular Generation
Summary
The paper presents DriftingMol, a novel framework for property-conditional molecular generation that aims to produce valid and diverse molecules efficiently. The authors introduce a two-stage approach that utilizes a frozen SELFIES Ξ²-VAE to create a latent space, where the decoder's hidden representation serves as a drift feature map. The key innovation is the decoder-coupled drift, which maintains fixed decoder weights while allowing gradients to be backpropagated through the decoder feature map to a DiT generator. This method enables a single generator evaluation followed by a frozen decoder pass, significantly reducing sampling costs. The authors demonstrate that their approach achieves high uniqueness and correlation with desired properties on the ZINC250K dataset, outperforming various controlled variants. The results indicate that preserving the gradient path through decoder features is crucial for effective property control, highlighting the advantages of the proposed method over traditional drifting models that rely on external features.
Methodology
The methodology consists of a two-stage pipeline: first, training a SELFIES Ξ²-VAE to create a continuous latent space for molecular representation. In the second stage, a DiT-style generator is trained with a drifting model objective, where decoder weights are fixed, and gradients are backpropagated through the decoder feature map to optimize the generator. The drift field is computed using decoder hidden features, allowing for efficient property-conditional generation.
Results
The default DriftingMol model achieved a QED Spearman correlation of 0.493 with 94.7% uniqueness, while the best decoder-coupled condition reached 0.510. Under protocol-matched four-property conditioning, the method achieved a mean Spearman correlation of up to 0.598. The results from 15 controlled variants consistently showed that models preserving the gradient path through decoder features outperformed other drift variants.
Implications
The findings suggest that decoder-coupled drift can significantly enhance the efficiency and effectiveness of property-conditional molecular generation, making it a valuable tool for drug discovery and materials science. The approach could lead to faster and more accurate generation of molecules that meet specific physicochemical properties.
TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
Time Series
- First study on pretraining contamination auditing specifically for Time Series Foundation Models (TSFMs).
- Introduces TSFMAudit, a framework leveraging probe adaptation dynamics to infer contamination risk.
- Demonstrates that contaminated datasets show faster loss reduction and less parameter movement during fine-tuning.
- Evaluated on 6 TSFMs and 187 datasets, outperforming 10 existing contamination auditing methods.
Read more
TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
Summary
The paper introduces TSFMAudit, a novel framework for auditing data contamination in Time Series Foundation Models (TSFMs). As TSFMs are increasingly pretrained on extensive datasets, concerns arise regarding the potential exposure of evaluation datasets during pretraining, which can lead to inflated performance estimates. The authors formalize the problem of pretraining contamination auditing and propose a method based on probe adaptation dynamics. The key insight is that contaminated datasets exhibit faster loss reduction and less parameter movement during fine-tuning compared to clean datasets. TSFMAudit is evaluated on six TSFMs across 187 datasets, using documented training source evidence as supervision, and is compared against ten competitive baselines adapted from the large language model literature. The results demonstrate that TSFMAudit consistently outperforms these baselines, highlighting its effectiveness in detecting contamination in time series data.
Methodology
TSFMAudit utilizes probe adaptation dynamics to assess contamination risk by analyzing loss reduction rates and parameter displacement during fine-tuning. The method formalizes the auditing process at the dataset level and employs documented training source evidence for supervision.
Results
TSFMAudit was tested on six TSFMs and 187 datasets, achieving superior detection performance compared to ten baseline methods adapted from the LLM domain. The results indicate that TSFMAudit effectively identifies contaminated datasets, thereby improving the reliability of performance evaluations in time series forecasting.
Implications
The findings underscore the importance of contamination auditing in time series forecasting, which can enhance the validity of model evaluations and ensure that performance metrics reflect true generalization capabilities. This framework can be applied to improve the integrity of benchmarking practices in various domains relying on time series data.
Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates
Theory
Optimization
- Introduces a novel OOMD algorithm that utilizes safeguarded large learning rates to enhance adaptation speed.
- Employs a dynamic penalty mechanism to manage the risks associated with large learning rates, ensuring theoretical robustness.
- Demonstrates significant reduction in adaptation lag, outperforming existing tuning-free algorithms in various datasets.
- Achieves near-optimal worst-case guarantees while allowing for aggressive adaptation to distribution shifts.
Read more
Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates
Summary
This paper addresses the challenge of maintaining predictive accuracy in non-stationary environments through online model selection (OMS). Traditional tuning-free algorithms face a trade-off between robustness and agility, as they restrict learning rates to small constants to ensure dynamic regret bounds, leading to significant adaptation lag during abrupt distribution shifts. The authors propose a novel optimistic online mirror descent (OOMD) algorithm that employs safeguarded large learning rates, allowing rates to scale up to Ξ(T), where T is the number of rounds. A key innovation is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates that incur excessive regret, thus eliminating the need for restrictive a priori constraints. The algorithm achieves a cumulative penalty of O(log T), enabling it to maintain near-optimal worst-case guarantees while outperforming tuning-free baselines in empirical evaluations across synthetic and eleven real-world datasets. The proposed method significantly reduces adaptation lag from hundreds of rounds to just a few, demonstrating its effectiveness in rapidly adapting to distribution shifts.
Methodology
The authors developed an optimistic online mirror descent algorithm that allows for large learning rates up to Ξ(T). They implemented a penalty mechanism that monitors updates for instability and excludes those that would lead to excessive regret, thus enabling faster adaptation without sacrificing theoretical guarantees.
Results
Empirical evaluations showed that the proposed algorithm significantly reduces adaptation lag from hundreds of rounds to just a few rounds, consistently outperforming tuning-free baselines across synthetic and diverse real-world datasets.
Implications
The findings suggest that the proposed method can enhance the robustness and efficiency of predictive models in dynamic environments, making it suitable for applications in areas such as finance, healthcare, and real-time data analysis where rapid adaptation to changes is critical.
Label-NTK Alignments and A Tighter Convergence Bound in the NTK Regime
Theory
Optimization
- Introduces Label-NTK and Residual-NTK alignments to improve convergence bounds in the NTK regime.
- Demonstrates that existing convergence guarantees are overly pessimistic due to reliance on the smallest NTK eigenvalue.
- Establishes a refined convergence bound that closely matches empirical training dynamics.
- Provides theoretical justification for the observed alignments under mild assumptions.
Read more
Label-NTK Alignments and A Tighter Convergence Bound in the NTK Regime
Summary
This paper addresses the limitations of existing convergence guarantees in the Neural Tangent Kernel (NTK) framework, which often rely on the smallest NTK eigenvalue, leading to overly pessimistic predictions that do not align with empirical training dynamics. The authors introduce two key phenomena: Label-NTK alignment and Residual-NTK alignment, which describe how the projections of data labels and residuals onto NTK eigenvectors correlate with the corresponding eigenvalues. By leveraging these alignments, the authors derive a refined convergence bound that utilizes the full NTK spectrum, significantly improving upon classical worst-case results. The paper provides both empirical evidence and theoretical justification for these findings under mild data assumptions, demonstrating that the convergence behavior predicted by their model closely matches practical training dynamics observed in experiments with multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) across various datasets.
Methodology
The authors analyze the relationship between data labels and the NTK eigen-spectrum, identifying the alignment phenomena. They derive a new convergence bound based on these alignments and provide empirical validation through experiments on neural network architectures.
Results
The paper presents a refined convergence bound that depends on the full NTK spectrum rather than solely on the smallest eigenvalue, leading to significantly tighter convergence guarantees. Empirical results show that this new bound aligns closely with actual training dynamics, outperforming classical predictions.
Implications
The findings suggest that understanding the alignment between labels and NTK eigenvalues can lead to better optimization strategies for training deep neural networks, potentially improving training efficiency and generalization in practical applications.
SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning
Efficient ML
Theory
Computer Vision
- SIKA-GP reduces the computational complexity of GP inference to O(log M) using sparse inducing kernel approximations.
- The method integrates seamlessly with Bayesian neural networks, enhancing scalability for high-dimensional feature representations.
- Empirical results show significant speedups in training and inference while maintaining predictive accuracy.
- The approach is applicable to deep feature learning, addressing challenges posed by large-scale datasets.
Read more
SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning
Summary
The paper introduces SIKA-GP, a novel approach aimed at accelerating Gaussian Process (GP) inference through the use of sparse inducing kernel approximations. Traditional GPs face significant computational challenges due to their cubic complexity in the number of training points, which limits their scalability to large datasets. SIKA-GP addresses this issue by employing a dyadic ordered template basis, resulting in a logarithmic complexity dependence on the number of inducing points. This method constructs compact and expressive kernel representations from sparsely activated bases, facilitating efficient tensorized GPU computations and integration with modern large-scale models. The authors demonstrate that SIKA-GP can be effectively embedded into Bayesian neural networks (BNNs) with sparse activations, achieving substantial speedups in both training and inference without compromising predictive performance. The empirical evaluations on vision and transformer-based language benchmarks confirm that SIKA-GP consistently delivers fast and accurate GP models, paving the way for scalable kernel learning in Bayesian deep learning applications.
Methodology
The authors propose a new algorithm, SIKA-GP, which utilizes a set of compactly supported basis functions with closed-form expressions to accelerate GP inference. The method leverages the sparse structure of the dyadic kernel basis, allowing for efficient computations and integration into BNN architectures. Theoretical and empirical analyses are conducted to validate the performance improvements over existing GP methods.
Results
The empirical evaluations demonstrate that SIKA-GP achieves significant speedups compared to both dense and sparse GP baselines while preserving predictive accuracy. The method is shown to be effective across various benchmarks in vision and language processing tasks, confirming its applicability in real-world scenarios.
Implications
SIKA-GP has the potential to enhance the scalability and efficiency of Gaussian processes in various applications, particularly in Bayesian deep learning contexts. Its ability to handle large datasets and high-dimensional features makes it a valuable tool for researchers and practitioners looking to implement uncertainty quantification in machine learning models.
On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series
Time Series
- Introduces PATHOFM, a transformer model for clinical time-series pretraining.
- Identifies and formalizes three key inductive biases: Local Completion, Temporal Continuity, and Unsupervised In-Context Dynamics.
- Demonstrates that dynamics-centric mixtures of objectives provide balanced transfer across classification and regression tasks.
- Highlights the importance of preserving waveform structure while ensuring generalizability across subjects.
Read more
On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series
Summary
This paper investigates the impact of inductive bias in pretraining models for clinical time-series data, specifically focusing on pathological gait analysis for spinal cord injury (SCI). The authors introduce PATHOFM, an encoder-centric transformer model pretrained on multivariate gait windows using three complementary objectives: Local Completion (LC), Temporal Continuity (TC), and Unsupervised In-Context Dynamics (uICD). The study highlights the challenges of small, heterogeneous cohorts in clinical settings and the need for representations that can generalize across different tasks, such as classification and regression. By empirically comparing various objective families, the authors find that a dynamics-centric mixture of objectives yields the best transfer performance across tasks while maintaining fidelity in continuous targets. The paper contributes a practical taxonomy of inductive biases and demonstrates how combining local reconstruction with temporal continuity and in-context conditioning can lead to robust subject-generalizing representations.
Methodology
The authors developed PATHOFM, an encoder-only transformer model pretrained on gait windows using three objectives: Local Completion (reconstructing masked spans), Temporal Continuity (predicting future states), and Unsupervised In-Context Dynamics (reconstructing based on subject-specific context). The model incorporates techniques for handling missing values and balancing subject representation during pretraining.
Results
The empirical evaluation showed that the dynamics-centric mixture of objectives outperformed other objective families in terms of transferability across classification and regression tasks, particularly under strict subject holdout conditions. The combination of local reconstruction and temporal continuity, along with in-context conditioning, resulted in robust representations that generalize well across different subjects.
Implications
The findings suggest that careful selection of inductive biases in pretraining can significantly enhance the performance of models on clinical time-series tasks. This approach can be applied to various medical domains where data is limited and heterogeneous, potentially improving diagnostic and predictive capabilities in clinical settings.
Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining
Time Series
- PTCD introduces a pretraining paradigm for time-series causal discovery, enhancing adaptability to new datasets.
- The framework captures both intra-window and inter-window dependencies through a dual-scale iterative attention mechanism.
- Causal augmentation strategies, including intervention-based tasks and causal mixup, improve generalization and robustness.
- Extensive experiments show PTCD's superior performance in causal discovery and root cause identification compared to existing methods.
Read more
Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining
Summary
This paper presents PTCD, a novel pretraining framework aimed at enhancing causal discovery from time series data. Traditional methods often struggle with generalization across diverse causal mechanisms and require dataset-specific optimization, limiting their applicability. PTCD addresses these challenges by employing context-conditioned modeling and causal augmentation strategies. The framework utilizes a dual-scale iterative attention mechanism to capture complex temporal dependencies and a Gaussian mixture model to adaptively estimate exogenous variable distributions. Additionally, PTCD incorporates a pretraining paradigm that leverages synthetic datasets, integrating intervention-based learning to break spurious correlations and enhance causal inference. The experimental results demonstrate that PTCD significantly outperforms existing methods in causal discovery and root cause identification across multiple real-world out-of-distribution datasets, showcasing its robustness and generalization capabilities.
Methodology
PTCD employs a hierarchical context-conditioned modeling approach that captures complex temporal causal dependencies through dual-scale iterative attention. It integrates a Gaussian mixture model for context-aware estimation of exogenous variables and utilizes a pretraining strategy that includes intervention-based tasks and causal mixup to enhance the model's generalization capabilities.
Results
The experiments conducted on various real-world out-of-distribution datasets indicate that PTCD consistently outperforms existing causal discovery methods, demonstrating its effectiveness in accurately identifying causal relationships and root causes in time series data.
Implications
The findings suggest that PTCD can be applied in various domains requiring causal analysis from time series data, such as finance, environmental monitoring, and system failure analysis, enabling more robust decision-making and insights.
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Generative Models
Reinforcement Learning
Robotics
- FAV provides a general alignment framework for few-step generative models without restrictive assumptions.
- The method leverages sample-based variational inference to decouple alignment from specific model families.
- FAV achieves state-of-the-art performance in both robotics manipulation and image generation tasks.
- The framework can fine-tune diverse generative models, including GANs and VAEs, across various scales.
Read more
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Summary
The paper introduces FAV (Few-step Generative Models Alignment via Sample-based Variational Inference), a novel framework designed to align few-step generative models without relying on restrictive assumptions such as tractable likelihoods or specific model families. FAV formulates the alignment process as sampling from a reward-tilted distribution, anchored to a reference distribution. The methodology employs Stein Variational Gradient Descent (SVGD) for sample-based variational inference, amortizing updates into generator parameters through fixed-point regression. The authors evaluate FAV across two domains: robotics manipulation and image generator alignment. In robotics, FAV demonstrates superior performance on 56 offline and 30 offline-to-online reinforcement learning tasks, outperforming existing policy extraction methods. For image generation, FAV effectively fine-tunes various few-step generative models, achieving high-quality outputs across different scales. The results indicate that FAV is agnostic to model architecture and sampling methods, providing a versatile tool for enhancing generative model performance.
Methodology
FAV formulates the alignment problem as sampling from a reward-tilted distribution using Stein Variational Gradient Descent (SVGD). It amortizes the resulting updates into the generator parameters through fixed-point regression, allowing it to work with only sample access to the generator and reference distribution. The method estimates the score of the reference distribution nonparametrically using kernel density estimation.
Results
FAV outperformed existing policy extraction paradigms in robotics manipulation across 56 offline and 30 offline-to-online reinforcement learning tasks. In image generation, FAV successfully fine-tuned various few-step generative models, achieving high aesthetic rewards and improving output quality at both ImageNet-256 and 10242 resolution scales.
Implications
The FAV framework has the potential to enhance the performance of few-step generative models in various applications, including robotics and high-resolution image synthesis. Its agnostic nature to model architecture allows for broader applicability in generative modeling tasks.
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
Large Language Models
Optimization
Interpretability
- GEM reformulates data curation as a variational problem on a hypersphere, addressing limitations of traditional methods.
- The framework includes a mixing-balance regularizer to prevent cluster collapse under embedding anisotropy.
- A provable MM-based inference algorithm ensures stable convergence for the regularized objective.
- GEM achieves linear-time complexity for web-scale deployment through teacher-student distillation.
Read more
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
Summary
The paper introduces GEM (Geometric Entropy Mixing), a novel framework for data curation in Large Language Models (LLMs) that addresses the limitations of existing categorization methods. Traditional taxonomy-based approaches suffer from ontological misalignment, while Euclidean clustering fails to account for the anisotropic nature of neural embeddings, leading to cluster collapse. GEM reformulates data curation as a variational problem on a hypersphere, incorporating a mixing-balance regularizer to maintain semantic balance. The authors develop a provable Minorize-Maximize (MM) algorithm for optimization, enabling the discovery of fine-grained semantic structures that are often overlooked by conventional methods. The framework is designed for scalability through a teacher-student distillation approach, allowing it to handle web-scale corpora efficiently. Additionally, the introduction of the Geometric Influence Score (GIS) facilitates the generation of interpretable taxonomies. Experimental results with 1.1B-parameter models show that GEM outperforms existing mixing strategies, achieving up to 1.2% improvement in downstream accuracy and establishing a robust coordinate system for data mixing.
Methodology
GEM employs a hyperspherical variational framework with a mixing-balance regularizer to optimize data curation. It utilizes a Minorize-Maximize (MM) algorithm for stable convergence and incorporates teacher-student distillation for scalability. The Geometric Influence Score (GIS) is introduced for generating interpretable taxonomies.
Results
The implementation of GEM in experiments with 1.1B-parameter models resulted in a new state-of-the-art performance, improving average downstream accuracy by up to 1.2% compared to existing mixing strategies like DoReMi and RegMix.
Implications
GEM's approach to data curation can significantly enhance the performance of LLMs by providing a more effective method for mixing heterogeneous data sources, potentially leading to better generalization and stability in model training.
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
Theory
- Introduces a PAC-Bayesian framework for PIML that provides generalisation guarantees in regression settings with unbounded losses.
- Develops a multi-task approach that jointly treats data fidelity and physical constraints, leading to tighter generalisation bounds.
- Establishes a direct connection between physical regularity and statistical performance through input-gradient dependent complexity terms.
- Proposes a self-bounding-aware learning algorithm that optimises derived bounds and estimates constants in practical settings.
Read more
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
Summary
This paper presents a novel PAC-Bayesian framework for understanding the generalisation properties of physics-informed machine learning (PIML), particularly in regression settings with unbounded losses. PIML integrates mechanistic knowledge, typically represented as partial differential equations (PDEs), into data-driven models, but its statistical generalisation capabilities remain poorly understood. The authors propose a multi-task perspective that jointly considers data fidelity, PDE residuals, and boundary conditions, avoiding the limitations of traditional union-bound approaches. By leveraging the structure of physics-informed objectives, they derive new generalisation bounds that scale with input-gradient norms of the losses, establishing a link between physical regularity and generalisation. The framework is instantiated under Sobolev and PoincarΓ©-type assumptions, leading to two classes of bounds that balance statistical complexity and smoothness. Additionally, the authors introduce a self-bounding-aware learning algorithm that optimises surrogates of the derived bounds and provide a practical method for estimating constants in real-world scenarios. Empirical evaluations on standard PDE benchmarks demonstrate that their bounds are tighter than existing union-bound baselines and can be effectively minimised during training, thus offering a principled statistical foundation for the generalisation of physics-informed models.
Methodology
The authors develop a PAC-Bayesian framework tailored for PIML, focusing on regression with unbounded losses. They adopt a multi-task perspective to jointly analyze data fidelity, PDE residuals, and boundary conditions. The framework derives new generalisation bounds based on Sobolev and PoincarΓ©-type assumptions, and introduces a learning algorithm that optimises these bounds during training.
Results
The proposed PAC-Bayesian bounds are shown to be significantly tighter than traditional union-bound baselines. Empirical evaluations on PDE benchmarks confirm that the bounds are non-vacuous and can be effectively minimised, demonstrating the practical applicability of the theoretical framework.
Implications
This work provides a robust statistical foundation for PIML, enhancing the understanding of how physical constraints influence model generalisation. The findings can improve the design of machine learning models that incorporate physical knowledge, potentially benefiting applications in scientific simulation, hybrid modeling, and other areas where physical laws are critical.
CAFD: Concept-Aware DNN Fault Detection using VLMs
Computer Vision
NLP
Efficient ML
- Introduction of Concept-Aware Fault Detection (CAFD) for DNNs.
- Novel Concept Failure Ratio (CFR) metric derived from Vision-Language Models (VLMs).
- CAFD integrates model-based, distance-based, and concept-based features for improved fault detection.
- Empirical evaluations show CAFD outperforms existing methods, especially under budget constraints.
Read more
CAFD: Concept-Aware DNN Fault Detection using VLMs
Summary
The paper presents Concept-Aware Fault Detection (CAFD), a novel approach for fault detection in Deep Neural Networks (DNNs) that integrates multiple information sources while maintaining computational efficiency. CAFD utilizes a unique feature called the Concept Failure Ratio (CFR), which quantifies the likelihood of DNN failures based on semantic characteristics extracted from images using Vision-Language Models (VLMs). The approach combines model-based signals, distance-based features, and the CFR to enhance fault detection performance. The authors conducted extensive empirical evaluations across various datasets, including ImageNet, and demonstrated that CAFD consistently outperforms five state-of-the-art baseline methods in terms of Fault Detection Rate (FDR), particularly under constrained selection budgets. The results indicate that the incorporation of CFR provides valuable semantic insights that significantly improve fault detection capabilities in DNNs.
Methodology
CAFD employs a learning-based approach that combines multiple feature types, including model outputs, uncertainty metrics, distance-based features, and the novel CFR derived from VLMs. The CFR captures semantic characteristics of inputs by measuring the likelihood of DNN failures associated with specific concepts. Logistic Regression was selected as the learning model due to its strong performance and computational efficiency.
Results
CAFD demonstrated superior performance in fault detection, achieving an average FDR improvement of 18.3% across all datasets and selection budgets compared to five state-of-the-art baselines. The method was particularly effective under constrained budgets, indicating its practical applicability in real-world scenarios.
Implications
The findings suggest that integrating semantic information from VLMs can significantly enhance fault detection in DNNs, making CAFD a promising approach for ensuring the reliability and robustness of DNNs in safety-critical applications. This could lead to more efficient testing processes and improved deployment of DNNs in various domains.
Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes
Interpretability
- Developed a framework for predicting multi-organ dysfunction in T2DM using routine clinical biomarkers.
- Gradient boosting model outperformed traditional logistic regression in predicting multi-system dysregulation.
- SHAP analysis provided insights into the contributions of various biomarkers to multi-system risk.
- The study highlights the importance of capturing complex interactions among biomarkers for better clinical outcomes.
Read more
Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes
Summary
This study addresses the systemic nature of Type 2 diabetes mellitus (T2DM), which is characterized by dysfunction across multiple organ systems. Traditional clinical assessments often fail to capture this complexity, leading to inadequate risk stratification. The authors conducted a retrospective analysis of 1,195 patients, utilizing routinely collected laboratory biomarkers to construct system-level abnormality indices that quantify organ-specific dysfunction. They employed supervised machine learning models, including logistic regression, random forest, and gradient boosting, to predict multi-system dysregulation. The gradient boosting model achieved near-perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). The study utilized SHapley Additive exPlanations (SHAP) for model interpretability, revealing that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the primary drivers of multi-system risk. The findings support the biological plausibility of the model predictions through doseβresponse relationships observed in partial dependence analyses. This research presents an interpretable, data-driven framework for quantifying systemic disease burden in T2DM, linking routine biomarkers to multi-organ dysfunction, and offering insights for improved risk stratification and precision medicine in diabetes care.
Methodology
The study utilized a retrospective design analyzing data from 1,195 patients. System-level abnormality indices were constructed to quantify organ-specific dysfunction. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi-system dysregulation. SHAP was employed for model interpretability to elucidate the contributions of individual biomarkers.
Results
The gradient boosting model achieved an area under the curve (AUC) of 1.000, indicating perfect discrimination, while logistic regression achieved an AUC of 0.925. Feature attribution analysis identified hyperglycaemia, renal impairment, dyslipidaemia, and inflammation as key contributors to multi-system risk. Partial dependence analyses supported the biological relevance of these findings.
Implications
This research provides a clinically meaningful and transparent approach for early identification of high-risk T2DM patients, potentially leading to improved risk stratification and personalized management strategies in diabetes care.
Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction
Generative Models
Theory
Time Series
- Proposes a unified framework that captures both latent causal mechanisms and temporal dynamics in single-cell perturbation responses.
- Introduces a theoretical analysis ensuring the identifiability of latent causal variables in the proposed model.
- Develops CITE-VAE, a framework that learns causally meaningful latent dynamics for principled generalization.
- Empirical results validate the effectiveness of the proposed method, outperforming existing approaches on benchmark datasets.
Read more
Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction
Summary
This paper addresses the challenge of predicting single-cell responses to perturbations, focusing on achieving out-of-distribution (OOD) generalization. The authors identify that existing methods either treat responses as static or fail to capture the underlying causal mechanisms of gene expression evolution over time. To overcome these limitations, they propose a latent dynamical causal generative model that integrates both latent cellular programs and their temporal evolution. The model is theoretically analyzed to ensure that latent causal variables can be uniquely recovered under specific conditions. The authors introduce CITE-VAE, a causality-inspired temporal Variational AutoEncoder framework, which learns causally meaningful latent dynamics. Empirical evaluations demonstrate that the proposed method outperforms state-of-the-art baselines on both synthetic and real-world CRISPR-based single-cell sequencing datasets, showcasing improved generalization capabilities for unseen perturbations.
Methodology
The authors propose a latent dynamical causal generative model that captures the interplay between latent cellular programs and their temporal evolution. They conduct a theoretical analysis to establish the identifiability of latent causal variables and develop a learning framework, CITE-VAE, to recover these variables from single-cell sequencing data.
Results
The proposed method was validated through experiments on synthetic data, confirming the theoretical results. Additionally, it demonstrated significant improvements in generalization to unseen perturbations on real-world CRISPR-based single-cell sequencing datasets compared to state-of-the-art baselines.
Implications
The findings suggest that integrating latent causal mechanisms with temporal dynamics can enhance predictive modeling in single-cell genomics, potentially guiding experimental design in functional genomics and drug discovery.