AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
68
Papers today
8h
Update frequency
7
Days of history
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
Generative Models
Graph Learning
Optimization
- GEM closes the fidelity gap between discrete energy-based models and discrete diffusion models for graph generation.
- The framework incorporates a transport-aligned discrete proposal for efficient sampling and exploration.
- GEM enables compositional constraints and property-based objectives during inference without retraining.
- The model achieves high-quality molecular graph generation, matching or exceeding existing state-of-the-art methods.
Read more
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
Summary
This paper introduces Graph Energy Matching (GEM), a novel generative framework for graph generation that addresses the limitations of traditional discrete energy-based models (EBMs) in terms of sampling efficiency and fidelity. GEM leverages a transport map optimization perspective, specifically the Jordan–Kinderlehrer–Otto (JKO) scheme, to learn a permutation-invariant potential energy function. This function guides the sampling process from noise to high-probability graph regions while refining samples in areas of high data likelihood. The authors propose a two-phase sampling protocol that first rapidly transports samples towards high-probability regions and then explores the learned graph distribution through local edits. The results demonstrate that GEM achieves or surpasses the performance of leading discrete diffusion models on molecular graph benchmarks, showcasing its ability to generate high-quality samples. Additionally, GEM's explicit modeling of relative likelihood allows for targeted exploration during inference, enabling compositional generation and property-constrained sampling. Overall, GEM represents a significant advancement in the field of graph generation, providing a robust framework for integrating domain-specific constraints and enhancing the quality of generated graphs.
Methodology
GEM employs a transport map optimization approach based on the JKO scheme to define a potential energy function that guides sampling. The methodology consists of a two-phase sampling protocol: an initial rapid transport phase towards high-probability graphs followed by a mixing phase for local exploration of the graph distribution. This approach allows for efficient sampling and the incorporation of constraints.
Results
GEM demonstrates superior performance on molecular graph benchmarks, matching or exceeding the quality of leading discrete diffusion models. The explicit modeling of relative likelihood facilitates targeted exploration and compositional generation, enhancing the practical applicability of the model.
Implications
The advancements presented in GEM have significant implications for various applications, including drug discovery and materials design, where high-quality graph generation is crucial. The ability to incorporate constraints and facilitate targeted sampling opens new avenues for research and practical implementations in generative modeling of structured data.
Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein
Interpretability
Multimodal
- CDT-III extends mechanism-oriented AI to encompass the entire central dogma, improving interpretability and prediction accuracy.
- The architecture's two-stage design allows for distinct modeling of transcription and translation processes.
- Joint prediction of RNA and protein changes leads to improved performance and interpretability.
- The model can predict clinical side effects and generate hypotheses from perturbation data alone, without clinical data.
Read more
Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein
Summary
The paper introduces the Central Dogma Transformer III (CDT-III), a novel AI architecture designed to bridge the interpretability gap in biological AI models that predict cellular responses. Traditional models often fail to connect predictions with the underlying molecular processes, particularly between DNA, RNA, and protein. CDT-III addresses this by mirroring the central dogma's information flow, employing a two-stage architecture that includes a Virtual Cell Embedder for the nucleus (VCE-N) and one for the cytosol (VCE-C). This design allows for interpretable attention maps at each layer and enables the joint prediction of mRNA and surface protein changes resulting from CRISPRi perturbations. The model demonstrates significant performance improvements, achieving high correlation coefficients for RNA (r = 0.843) and protein (r = 0.969) predictions. Notably, the inclusion of protein prediction enhances RNA performance and improves DNA-level interpretability. The model also successfully predicts protein changes and clinical side effects from in silico experiments without requiring clinical data, showcasing its potential for generating actionable insights in drug development and pharmacology.
Methodology
CDT-III employs a two-stage architecture consisting of a Virtual Cell Embedder for the nucleus (VCE-N) and one for the cytosol (VCE-C). This structure allows the model to process transcription and translation as separate yet interconnected modules, utilizing attention mechanisms that correspond to specific biological processes. The model predicts changes in RNA and protein levels based on CRISPRi perturbations, leveraging pre-computed embeddings and a differentiable architecture that reflects the central dogma's information flow.
Results
CDT-III achieved a correlation coefficient of r = 0.843 for RNA predictions and r = 0.969 for protein predictions across five held-out genes. The addition of protein prediction improved RNA performance from r = 0.804 to r = 0.843 and enhanced DNA-level interpretability by increasing CTCF enrichment by 30%. In an in silico CD52 knockdown experiment, the model accurately predicted 29 out of 29 protein changes in the correct direction and rediscovered 5 out of 7 known clinical side effects.
Implications
The findings suggest that AI models designed to align with biological structures can enhance our understanding of cellular mechanisms and improve the prediction of drug side effects. This approach could facilitate more effective drug development processes by providing insights based on perturbation data alone, potentially reducing the need for extensive clinical trials.
Scaling Attention via Feature Sparsity
NLP
Large Language Models
Efficient ML
- Introduces Sparse Feature Attention (SFA) to reduce self-attention costs by leveraging feature sparsity.
- FlashSFA kernel enhances efficiency by avoiding the materialization of dense score matrices.
- Achieves up to 2.5× speedup and nearly 50% reduction in computational resources compared to dense attention.
- Maintains accuracy and robustness in long-context scenarios, outperforming short-embedding baselines.
Read more
Scaling Attention via Feature Sparsity
Summary
This paper addresses the challenge of scaling Transformers to ultra-long contexts, which is limited by the quadratic cost of self-attention. Existing methods typically reduce this cost along the sequence axis but often at the expense of accuracy. The authors propose a novel approach called Sparse Feature Attention (SFA), which leverages feature sparsity instead of token sparsity. SFA represents queries and keys as k-sparse codes, significantly reducing the computational cost from Θ(n²d) to Θ(n²k²/d). To enhance efficiency, they introduce FlashSFA, an IO-aware kernel that operates directly on sparse overlaps without creating dense score matrices. Experimental results demonstrate that SFA matches the performance of dense attention baselines while achieving up to 2.5× speed improvements and nearly 50% reductions in FLOPs and KV-cache usage. The method maintains retrieval accuracy and robustness in long contexts, outperforming traditional short-embedding approaches. This work highlights feature-level sparsity as a promising avenue for efficient attention mechanisms, enabling Transformers to handle much longer contexts with minimal quality loss.
Methodology
The authors developed Sparse Feature Attention (SFA) by representing queries and keys as k-sparse codes, which allows for reduced computational costs while preserving high-dimensional expressivity. They also introduced FlashSFA, an IO-aware kernel that computes attention scores directly from sparse overlaps, avoiding the need for dense score matrices.
Results
SFA matches the performance of dense attention baselines in terms of perplexity and downstream accuracy while achieving significant speed improvements (up to 2.5×) and reductions in computational resources (nearly 50% fewer FLOPs and KV-cache). The method also demonstrates consistent retrieval accuracy across long contexts.
Implications
This research opens up new possibilities for scaling Transformers to handle longer contexts efficiently, which could enhance various applications in natural language processing and other domains requiring extensive context handling.
Computationally lightweight classifiers with frequentist bounds on predictions
Theory
Efficient ML
- Introduction of a computationally efficient classifier based on the Nadaraya-Watson estimator.
- Derivation of frequentist uncertainty bounds for predicted class probabilities.
- Achieves competitive accuracy (>96%) with linear and sublinear computational complexity.
- Validated on synthetic and real-world medical data, highlighting its practical applicability.
Read more
Computationally lightweight classifiers with frequentist bounds on predictions
Summary
This paper addresses the challenge of providing uncertainty bounds in classification tasks while maintaining computational efficiency. Traditional classifiers, including classical and neural network approaches, often achieve high accuracy but lack reliable uncertainty quantification, which is critical for safety-critical applications. The authors propose a novel classification algorithm based on the Nadaraya-Watson estimator, which is reformulated for classification purposes. They derive frequentist uncertainty bounds on the predicted class probabilities, ensuring that the predictions are not only accurate but also accompanied by actionable confidence measures. The proposed method scales linearly with the size of the training set, achieving competitive accuracy (>96%) while operating at O(n) and O(log n) complexities. The classifier is validated on both synthetic datasets and real-world electrocardiographic data, demonstrating its effectiveness in providing uncertainty bounds at a fraction of the computational cost compared to existing methods.
Methodology
The authors reformulate the Nadaraya-Watson estimator as a classifier and derive frequentist bounds on the estimated class probabilities. They implement computationally efficient variations of the naive approach to enhance performance, achieving linear to sublinear complexity. The methodology includes theoretical derivations and empirical evaluations on synthetic and real datasets.
Results
The proposed classifier demonstrates competitive accuracy exceeding 96% while maintaining computational efficiency with operations scaling as O(n) and O(log n). The frequentist uncertainty bounds provide actionable insights into prediction confidence, making the method suitable for real-time applications.
Implications
This work has significant implications for the deployment of machine learning classifiers in safety-critical environments, such as medical diagnostics and monitoring systems, where both accuracy and uncertainty quantification are essential. The computational efficiency allows for real-time processing in resource-constrained settings.
A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling
Reinforcement Learning
Optimization
Theory
- Introduction of WeCAN, a reinforcement learning framework for heterogeneous DAG scheduling.
- Development of a two-stage single-pass design for efficient schedule generation.
- Order-space analysis revealing generation-induced optimality gaps and conditions for their elimination.
- Skip-extended realization to enhance scheduling efficiency while preserving single-pass capabilities.
Read more
A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling
Summary
This paper presents WeCAN, a novel reinforcement learning framework designed for efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments. The challenge of scheduling in such contexts arises from varying resource capacities and task dependencies, which necessitate adaptability and rapid schedule generation. WeCAN employs a two-stage single-pass design that first generates task-pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. The framework incorporates a weighted cross-attention encoder to model task-pool interactions, ensuring it remains size-agnostic to fluctuations in the environment. A significant contribution of this work is the introduction of an order-space analysis that characterizes the reachable set of generation maps, elucidating the mechanisms behind generation-induced optimality gaps. The authors propose sufficient conditions for eliminating these gaps and develop a skip-extended realization with a parameterized decreasing skip rule to enhance the reachable order set while maintaining efficiency. Experimental results demonstrate that WeCAN outperforms strong baseline methods in terms of makespan while achieving inference times comparable to classical heuristics and faster than multi-round neural schedulers.
Methodology
The methodology involves a two-stage reinforcement learning framework where a weighted cross-attention encoder is used to model task-pool interactions. The framework incorporates an order-space analysis to identify and address generation-induced optimality gaps, leading to the design of a skip-extended realization that enhances scheduling efficiency.
Results
WeCAN demonstrated improved makespan in scheduling tasks compared to strong baseline methods, with inference times that are competitive with classical heuristics and significantly faster than multi-round neural schedulers.
Implications
The findings suggest that WeCAN can be effectively applied in various domains requiring efficient task scheduling, such as data centers and cloud computing environments, where heterogeneous resource management is crucial.
ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
Large Language Models
Efficient ML
NLP
- ROM is the first method to treat overthinking mitigation as a streaming prediction-and-control problem.
- It utilizes a lightweight detection head for real-time monitoring of token generation.
- The introduction of Counterfactual Self-Correction (CSC) enhances token-level supervision.
- ROM significantly reduces response length by 47.2% while improving response efficiency.
Read more
ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
Summary
The paper introduces ROM, a novel framework aimed at mitigating overthinking in Large Reasoning Models (LRMs) during real-time inference. Overthinking occurs when LRMs continue generating reasoning steps even after reaching the correct answer, leading to increased latency and computational costs. Existing methods for addressing this issue are either training-intensive or rely on heuristic approaches that do not effectively capture overthinking patterns. ROM addresses these limitations by formulating overthinking mitigation as a streaming prediction-and-control problem. It employs a lightweight detection head attached to the late-layer hidden states of a frozen LLM backbone, which monitors token generation in real time and triggers an early transition to the final answer upon detecting overthinking. The authors also introduce a novel data augmentation strategy called Counterfactual Self-Correction (CSC) to enhance token-level supervision by synthesizing balanced wrong-to-correct trajectories. The results demonstrate that ROM outperforms existing methods across seven benchmarks, achieving the highest accuracy (93.51%), the shortest response length (1,159 tokens), and improved efficiency (121% increase). This work highlights the potential of streaming detection as an effective approach for real-time overthinking mitigation in LRMs.
Methodology
ROM attaches a lightweight detection head to the late-layer hidden states of a frozen LLM backbone, which monitors the generation of tokens in real time. It outputs an overthinking score at each step, allowing for early intervention when overthinking is detected. The method incorporates token-level supervision based on correctness boundaries and employs a data augmentation strategy (CSC) to address biases in distilled reasoning data.
Results
Across seven benchmarks, ROM achieved an accuracy of 93.51%, a response length of 1,159 tokens, and improved efficiency by 121%. It reduced response length by 47.2% compared to the vanilla baseline, demonstrating its effectiveness in mitigating overthinking.
Implications
The findings suggest that ROM could enhance the efficiency and reliability of LRMs in various applications, particularly in real-time decision-making scenarios where timely responses are critical. The approach could be adapted for other models and tasks requiring efficient reasoning.
Detection of adversarial intent in Human-AI teams using LLMs
Large Language Models
NLP
Reinforcement Learning
- LLMs can act as defensive supervisors in human-AI teams, detecting adversarial intent from behavioral patterns.
- The study utilizes a dataset from a trivia game to analyze multi-party interactions involving a malicious AI.
- LLMs demonstrated the ability to identify malicious behavior in real-time without task-specific knowledge.
- The research suggests that LLMs can enhance the robustness of human-AI teams against adversarial attacks.
Read more
Detection of adversarial intent in Human-AI teams using LLMs
Summary
This paper explores the role of large language models (LLMs) as defensive supervisors in human-AI teams, specifically focusing on their ability to detect adversarial intent through non-verbal behavioral patterns. The authors highlight the vulnerabilities of LLMs when deployed in collaborative settings, where malicious actors can exploit these systems to manipulate human decision-making. By utilizing a dataset of multi-party interactions in a trivia game involving three human agents and a malicious AI, the authors demonstrate that LLMs can effectively identify malicious behavior in real-time without needing task-specific information. This capability suggests that LLMs can serve as task-agnostic defenders, enhancing the robustness of human-AI teams against various attack vectors. The findings indicate that LLMs can recognize strategic deception patterns from behavioral traces, paving the way for safer human-AI collaboration.
Methodology
The authors formulated the problem of malicious behavior detection using a dataset of interactions from a 25-round trivia game. They developed a behavioral detection pipeline where LLMs perform binary classification on serialized interaction data to identify adversarial intent. The approach is designed to be task-agnostic, relying solely on behavioral signals without any prior knowledge of the specific task.
Results
The results show that LLMs can effectively identify adversarial behavior in real-time, even with minimal information about the task or the adversary's strategy. The study found that lightweight commercial models could be fine-tuned to reliably flag adversarial behavior, indicating the potential for LLMs to recognize patterns of strategic deception from behavioral traces alone.
Implications
The findings suggest that integrating LLMs as overseers in human-AI teams could significantly improve the detection of adversarial intent, thereby enhancing trust and collaboration in high-stakes environments. This research opens avenues for developing more resilient AI systems capable of safeguarding against manipulation and deception.
Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations
Theory
- ICONs extend the capabilities of operator networks to higher-order PDEs.
- The model maintains qualitative accuracy despite reduced point-wise accuracy in complex problems.
- New computational methods are required for efficient training on higher-dimensional differential equations.
- The study quantifies the generalization limits of ICONs for in-distribution and out-of-distribution problems.
Read more
Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations
Summary
This paper explores the generalization capabilities of In-Context Operator Networks (ICONs), a novel class of operator networks designed for solving higher-order partial differential equations (PDEs). The authors extend the foundational model to handle a broader range of differential equations, demonstrating that while complex inputs necessitate new computational methods, the core machine learning techniques remain consistent with simpler cases. The study reveals that although point-wise accuracy diminishes for higher-order problems, such as the heat equation, the model successfully retains qualitative accuracy in capturing the dynamics and overall behavior of solutions. This indicates the model's ability to extrapolate essential solution characteristics beyond its training regime. The paper also discusses the implementation of a user interface for equation solutions, highlighting that while the model's evaluated solutions are adequate for simpler problems, they do not yet surpass traditional numerical solvers for more complex equations.
Methodology
The authors employed in-context learning techniques combined with operator learning to develop ICONs. They generated synthetic data using numerical methods and implemented a model architecture inspired by encoder-decoder structures. The study involved evaluating the model's performance on both in-distribution and out-of-distribution problems to assess generalization capabilities.
Results
The results indicate that while ICONs exhibit a decline in point-wise accuracy for higher-order PDEs, they effectively capture the qualitative dynamics of solutions. The model's performance was analyzed through various problem setups, demonstrating strong generalization within the training distribution and adaptability to varying input complexities.
Implications
The findings suggest that ICONs could be a valuable tool for solving complex differential equations that lack analytical solutions, potentially transforming approaches in computational mathematics and engineering. The ability to generalize to new problems without retraining could streamline workflows in various scientific and engineering applications.
CN-Buzz2Portfolio: A Chinese-Market Dataset and Benchmark for LLM-Based Macro and Sector Asset Allocation from Daily Trending Financial News
NLP
Large Language Models
- Introduction of CN-Buzz2Portfolio as a benchmark for evaluating LLMs in financial asset allocation.
- Focus on macro and sector allocation rather than individual stock picking to reduce noise and improve evaluation accuracy.
- Development of a Tri-Stage CPA Agent Workflow for systematic assessment of LLMs.
- Significant disparities observed among LLMs in translating financial narratives into actionable portfolio strategies.
Read more
CN-Buzz2Portfolio: A Chinese-Market Dataset and Benchmark for LLM-Based Macro and Sector Asset Allocation from Daily Trending Financial News
Summary
The paper introduces CN-Buzz2Portfolio, a novel benchmark designed to evaluate Large Language Models (LLMs) in the context of macro and sector asset allocation based on daily trending financial news in the Chinese market. The authors identify significant limitations in existing evaluation paradigms, particularly the reliance on direct live trading and static benchmarks that focus on entity-level stock picking. To address these issues, CN-Buzz2Portfolio provides a reproducible dataset that simulates a realistic public attention stream, requiring agents to derive investment logic from broader market narratives rather than pre-filtered news. The proposed Tri-Stage CPA Agent Workflow—comprising Compression, Perception, and Allocation—enables the assessment of LLMs across diversified asset classes like Exchange Traded Funds (ETFs), thereby reducing idiosyncratic volatility. The authors conduct extensive experiments with nine LLMs, revealing notable differences in how these models translate macro-level narratives into portfolio weights. This work not only contributes a new dataset and evaluation framework but also enhances understanding of the relationship between reasoning and financial decision-making in a complex market environment.
Methodology
The authors curated a dataset from multi-platform daily trending news, simulating a public attention stream relevant to the Chinese market. They proposed a novel task that requires agents to construct diversified portfolios based on macro and sector narratives. The evaluation utilized a standardized Tri-Stage CPA Agent Workflow to compare the performance of various LLMs.
Results
The experiments revealed significant differences in how the nine evaluated LLMs interpreted macro-level narratives and allocated assets, highlighting the varying capabilities of these models in financial reasoning and decision-making.
Implications
The findings suggest that LLMs can be effectively utilized as financial agents capable of making informed asset allocation decisions based on macroeconomic narratives. The open-source nature of the dataset and evaluation framework encourages further exploration and development of LLMs in financial contexts.
Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores
NLP
Large Language Models
Interpretability
- Introduces a compact uncertainty estimation method using intra-layer local information scores.
- Achieves competitive performance compared to traditional probing methods with a single forward pass.
- Demonstrates robustness under cross-dataset transfer and quantization.
- Provides insights into cross-layer agreement patterns in LLMs.
Read more
Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores
Summary
This paper addresses the challenge of uncertainty estimation (UE) in large language models (LLMs), which often produce confident but incorrect outputs. The authors propose a novel method that leverages intra-layer local information scores to assess cross-layer agreement patterns in internal representations of LLMs. Unlike traditional output-based heuristics that are brittle and probing methods that are high-dimensional and task-specific, the proposed method is compact and efficient, requiring only a single forward pass through the model. The approach computes a pairwise KL divergence between layers at task-relevant tokens, resulting in a structured representation of neuronal activations. A gradient-boosted decision tree (GBDT) is then trained on these representations to predict the correctness of the model's answers, providing a per-instance uncertainty score. The authors validate their method across three different models and various datasets, demonstrating that it matches or outperforms existing probing techniques in both in-distribution and cross-dataset scenarios. The findings suggest that the proposed method not only enhances performance but also offers insights into how different models encode uncertainty, paving the way for safer deployment of LLMs in critical applications.
Methodology
The authors compute KL divergence between layers' post-MLP activations at task-relevant tokens to create a structured representation of cross-layer agreement. A GBDT model is then trained on these representations to produce uncertainty scores for individual predictions.
Results
The proposed method matches probing performance in in-distribution scenarios with minor differences and significantly outperforms probing in cross-dataset transfer. It shows improvements of up to +2.86 AUPRC and +21.02 Brier points. Even under 4-bit weight-only quantization, it maintains robustness, improving by +1.94 AUPRC points and +5.33 Brier points on average.
Implications
The findings suggest that the proposed uncertainty estimation method can enhance the reliability of LLMs in safety-critical applications, providing a lightweight and interpretable approach to understanding model confidence and correctness.
Calibeating Made Simple
Theory
Optimization
- Calibeating is shown to be minimax-equivalent to regret minimization, allowing for a unified analysis across different loss functions.
- New optimal rates for multi-calibeating are derived, improving upon previous results for multiple forecasters.
- The paper introduces a meta-algorithm that achieves simultaneous calibeating and calibration for the Brier loss, providing optimal rates.
- The results extend existing guarantees for specific losses to a broader class of mixable and bounded losses.
Read more
Calibeating Made Simple
Summary
This paper addresses the problem of calibeating, which involves post-processing external forecasts to minimize cumulative losses while matching an informativeness-based benchmark. The authors present a novel approach that reduces calibeating to established online learning techniques, allowing for a more generalized analysis across various loss functions. They demonstrate that calibeating is minimax-equivalent to regret minimization, recovering existing results for Brier and log losses and extending them to mixable and general bounded losses. Additionally, they introduce a framework for multi-calibeating, showing its equivalence to combining calibeating with the classical expert problem, yielding improved rates for multiple forecasters. The paper also provides new bounds for achieving both calibeating and calibration simultaneously, particularly for the Brier loss, resulting in the first calibrated algorithm that achieves optimal calibeating rates for binary predictions. Overall, the work simplifies the understanding of calibeating and offers a modular approach to derive results for various loss functions.
Methodology
The authors utilize reductions from calibeating to standard online learning primitives, allowing them to derive general upper and lower bounds for various loss functions. They analyze calibeating through the lens of no-regret learning and expert problems, employing classical algorithms to instantiate their reductions.
Results
The paper establishes that calibeating can achieve O(log T) rates for Brier and log losses, extends these results to mixable losses, and provides new bounds for general bounded losses. For multi-calibeating, they derive logarithmic bounds that significantly improve upon previous polynomial dependencies. Additionally, they present a meta-algorithm that ensures simultaneous calibeating and calibration, achieving optimal rates for binary predictions.
Implications
The findings have significant implications for probabilistic forecasting in various domains, as they provide a framework for improving the reliability of forecasts without sacrificing their informativeness. This can enhance decision-making processes in fields such as finance, healthcare, and machine learning applications where accurate probability estimates are crucial.
Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions
Theory
Graph Learning
- Causal discovery is feasible in chain-reaction systems using blocking interventions.
- The proposed method achieves exponential error decay and logarithmic sample complexity.
- Experiments validate the effectiveness of the method in diverse causal environments.
- Observational heuristics fail in complex scenarios, highlighting the need for interventional approaches.
Read more
Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions
Summary
This paper addresses the challenges of causal discovery in dynamic systems, particularly those exhibiting chain-reaction mechanisms where components activate sequentially. The authors propose a novel approach to identify causal structures through blocking interventions that prevent individual components from activating. They demonstrate that the causal graph can be uniquely identified in such systems, which are characterized by directional, cascade-like interactions. The study introduces a minimal estimator with finite-sample guarantees, achieving exponential error decay and logarithmic sample complexity. Through experiments on synthetic models and various chain-reaction environments, the authors show that their method reliably recovers causal structures from a limited number of interventions, outperforming traditional observational heuristics that struggle in scenarios with delayed or overlapping causal effects.
Methodology
The authors model interactions in chain-reaction systems using a Structural Causal Model (SCM) with binary variables representing object activations. They apply blocking interventions to identify causal relationships, proving that the causal structure is identifiable from these interventions. A finite-sample estimator is proposed, supported by theoretical guarantees.
Results
The results indicate that the proposed method can reliably recover the causal structure from a small number of interventions, demonstrating exponential error decay and logarithmic sample complexity. The method outperformed observational heuristics, particularly in scenarios with causal ambiguities such as simultaneous activations.
Implications
This research has significant implications for fields where understanding causal relationships is crucial, such as biological systems, safety mechanisms, and complex engineered systems. It provides a framework for effective causal discovery that can enhance decision-making and system design.
Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection
Theory
Interpretability
- Introduction of a new robustness metric applicable to any probabilistic discriminative classifier.
- The metric is based on Constant Odds Ratio (COR) perturbation, allowing for use with continuous and mixed features.
- Demonstrated superior correlation with accuracy compared to existing robustness metrics.
- Application of the robustness metric in dynamic classifier selection strategies.
Read more
Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection
Summary
This paper addresses the challenge of quantifying the robustness of discriminative models in machine learning, particularly focusing on how much uncertainty a classifier can handle before its prediction changes. The authors propose a new robustness metric that is applicable to any probabilistic discriminative classifier and any type of features, overcoming limitations of existing methods that require generative models or are restricted to specific architectures or discrete features. The new metric is based on the Constant Odds Ratio (COR) perturbation, allowing for a broader application. The authors demonstrate that this metric effectively distinguishes between reliable and unreliable predictions through experiments using Accuracy Rejection Curves, showing better correlation with accuracy compared to existing alternatives. Furthermore, the paper explores the application of this robustness metric in dynamic classifier selection, proposing strategies for selecting the most appropriate model based on feature inputs. The findings suggest that robustness quantification can enhance decision-making processes in machine learning by providing clearer insights into the reliability of predictions.
Methodology
The authors developed a new robustness metric based on COR perturbation, which quantifies the stability of predictions against perturbations in the underlying probability distribution. They conducted experiments using Accuracy Rejection Curves to evaluate the correlation of their metric with prediction accuracy across various model architectures. Additionally, they proposed two strategies for dynamic classifier selection based on the robustness of predictions.
Results
The new robustness metric showed a strong correlation with the accuracy of predictions, outperforming existing metrics in various contexts. The dynamic classifier selection strategies developed using this metric demonstrated effective model selection based on the robustness of individual predictions, enhancing the reliability of decision-making in uncertain environments.
Implications
The proposed robustness metric and its application in dynamic classifier selection have significant implications for improving the reliability of machine learning models, particularly in high-stakes decision-making scenarios. This work could lead to more robust and interpretable models that better handle uncertainty, ultimately enhancing user trust and safety in machine learning applications.
MsFormer: Enabling Robust Predictive Maintenance Services for Industrial Devices
Time Series
- MsFormer is designed to capture multi-scale temporal dependencies in industrial IoT sensor data.
- The model incorporates a Multi-scale Sampling module and a lightweight attention mechanism for improved performance in data-scarce environments.
- Extensive experiments show that MsFormer outperforms existing predictive maintenance models across diverse datasets and conditions.
- The framework addresses the limitations of traditional deep learning methods in modeling long-term degradation patterns.
Read more
MsFormer: Enabling Robust Predictive Maintenance Services for Industrial Devices
Summary
The paper presents MsFormer, a lightweight Multi-scale Transformer model designed to enhance predictive maintenance services for industrial devices by addressing the challenges posed by streaming sensor data. Traditional deep-learning methods struggle to capture complex dependencies in industrial IoT sensor data, particularly due to multi-scale temporal correlations and limited dataset sizes. MsFormer introduces a Multi-scale Sampling (MS) module and a tailored position encoding mechanism to effectively model these correlations across multiple time horizons. Additionally, it employs a lightweight attention mechanism that utilizes straightforward pooling operations instead of self-attention, making it more suitable for data-scarce environments. The proposed framework is validated through extensive experiments on real-world datasets, demonstrating significant performance improvements over existing state-of-the-art methods, showcasing strong generalizability across various industrial devices and operating conditions while maintaining a reliable Quality of Service (QoS).
Methodology
MsFormer employs a four-stage framework that includes a Multi-scale Sampling module to restructure timestamps for capturing multi-scale temporal correlations, a lightweight attention mechanism for efficient processing, and a tailored position encoding to enhance cross-scale correlation extraction. The model is specifically designed to work effectively with smaller-scale datasets typical in industrial settings.
Results
The experimental results indicate that MsFormer significantly surpasses state-of-the-art methods in predictive maintenance tasks, achieving improved accuracy and reliability across various industrial devices and operating conditions.
Implications
The development of MsFormer has potential applications in enhancing predictive maintenance services in various industrial sectors, leading to reduced downtime and improved operational efficiency. Its ability to function effectively in data-scarce environments makes it a valuable tool for industries reliant on IoT sensor data.
Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions
Theory
- Robustness Quantification (RQ) outperforms Uncertainty Quantification (UQ) in assessing classifier reliability.
- RQ and UQ are complementary approaches that can be combined for enhanced reliability assessments.
- The study utilizes real benchmark datasets to validate the effectiveness of RQ and UQ.
- Both methods address the inherent uncertainties in classifier predictions, particularly in high-stakes scenarios.
Read more
Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions
Summary
This paper investigates two methodologies for assessing the reliability of classifier predictions: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). The authors elucidate the conceptual distinctions between RQ and UQ, and conduct a comparative analysis using various benchmark datasets. The findings indicate that RQ can outperform UQ in both standard conditions and scenarios involving distribution shifts. Furthermore, the authors explore the complementarity of RQ and UQ, demonstrating that integrating both approaches can yield superior reliability assessments. The study emphasizes the importance of reliability in high-stakes AI applications, where understanding the reliability of individual predictions is crucial. The paper also discusses the theoretical underpinnings of both methods, focusing on their application to probabilistic generative classifiers, specifically the Naive Bayes Classifier and Generative Forests.
Methodology
The authors benchmarked RQ and UQ on real datasets, comparing their performance in terms of reliability assessment for classifiers. They utilized probabilistic generative classifiers, specifically the Naive Bayes Classifier and Generative Forests, to evaluate the methods under various conditions, including distribution shifts.
Results
The results indicate that RQ consistently outperforms UQ in reliability assessments across different datasets. Additionally, the combination of RQ and UQ provides even better reliability evaluations, highlighting the strengths of both methodologies.
Implications
The findings suggest that employing both RQ and UQ can significantly improve the reliability of AI predictions, which is particularly beneficial in critical applications such as healthcare, where decision-making relies heavily on the trustworthiness of model outputs.
A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life
Time Series
Optimization
- Proposes a multi-task targeted learning framework for SOH and RUL prediction.
- Integrates multi-scale CNNs, improved extended LSTM, and dual-stream attention modules.
- Achieves significant performance improvements over traditional and state-of-the-art methods.
- Utilizes Hyperopt for optimization, reducing manual hyperparameter tuning.
Read more
A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life
Summary
This paper addresses the critical need for accurate prediction of the state-of-health (SOH) and remaining useful life (RUL) of lithium-ion batteries, which is essential for the safe and efficient operation of electric vehicles. The authors identify limitations in current deep learning approaches, particularly in feature extraction and modeling temporal dependencies, especially when using traditional recurrent neural networks. To overcome these challenges, they propose a novel multi-task targeted learning framework that integrates multiple neural network components: a multi-scale feature extraction module, an improved extended LSTM, and a dual-stream attention module. The multi-scale CNNs are designed to capture detailed local battery decline patterns, while the improved extended LSTM enhances the model's ability to retain long-term temporal information. The dual-stream attention module focuses on key information relevant to SOH and RUL predictions by assigning higher weights to important features. The framework achieves a many-to-two mapping through a dual-task layer and utilizes the Hyperopt optimization algorithm to enhance performance and minimize manual hyperparameter tuning. Experimental results on battery aging datasets show significant improvements, with reductions in average RMSE for SOH and RUL predictions by 111.3% and 33.0%, respectively, compared to existing methods.
Methodology
The proposed framework employs a multi-scale feature extraction module using CNNs to capture local battery decline patterns, an improved extended LSTM for long-term temporal information retention, and a dual-stream attention module that focuses on key features relevant to SOH and RUL. The model's performance is optimized using the Hyperopt algorithm.
Results
The proposed method reduces the average RMSE for SOH predictions by 111.3% and for RUL predictions by 33.0% compared to traditional and state-of-the-art methods, demonstrating its effectiveness in accurately predicting battery health and lifespan.
Implications
The framework can significantly enhance battery management systems in electric vehicles, leading to improved safety and efficiency. It can also be applied to other domains where accurate time-series predictions are crucial.
Behavioral Heterogeneity as Quantum-Inspired Representation
Theory
Time Series
Robotics
- Introduces a quantum-inspired framework for modeling driver behavior as evolving latent states.
- Uses density matrices to capture the dynamic transitions between different driving behaviors.
- Employs non-linear Random Fourier Features for embedding behavioral observations.
- Demonstrates the approach on empirical driving data, highlighting its effectiveness in extracting driving profiles.
Read more
Behavioral Heterogeneity as Quantum-Inspired Representation
Summary
This paper addresses the challenge of modeling driver heterogeneity in transportation systems, which is often oversimplified into static categories. The authors propose a quantum-inspired representation that treats each driver as an evolving latent state, represented by a density matrix. This approach captures the dynamic nature of driving behavior, allowing for a richer understanding of how drivers transition between different behavioral modes based on context and temporal persistence. The methodology employs non-linear Random Fourier Features to embed behavioral observations, facilitating the extraction and analysis of driving profiles from empirical data, specifically the Third Generation Simulation Data (TGSIM). The authors emphasize the importance of understanding the logic governing behavioral transitions to avoid misinterpreting data when training agents in mixed traffic environments. Their framework is designed to be model-agnostic, preserving the complexity of behavioral transitions while remaining statistically tractable. The paper also provides an open-source codebase to support reproducibility of the results.
Methodology
The authors develop a quantum-inspired modeling framework that satisfies specific structural criteria, including the use of normalized states in a Hilbert space, quadratic measurement rules consistent with the Born rule, and valid density matrices. The framework involves representing driver behavior through a behavioral vector derived from trajectory data, followed by constructing density matrices to model latent states and their evolution over time.
Results
The evaluation of the proposed framework on TGSIM data shows successful extraction and analysis of driving profiles, demonstrating the ability to capture the richness of behavioral transitions and the contextual sensitivity of driving behavior.
Implications
This work has significant implications for the development of autonomous vehicles and advanced driver assistance systems, as it provides a more nuanced understanding of driver behavior that can inform the design of intelligent traffic systems and improve safety in mixed traffic environments.
Balancing Safety and Efficiency in Aircraft Health Diagnosis: A Task Decomposition Framework with Heterogeneous Long-Micro Scale Cascading and Knowledge Distillation-based Interpretability
Interpretability
Time Series
Efficient ML
- Introduction of the Diagnosis Decomposition Framework (DDF) for aircraft health diagnosis.
- Separation of diagnosis into Anomaly Detection (AD) and Fault Classification (FC) for improved efficiency.
- Utilization of advanced techniques like ConvTokMHSA and MMK Net for feature extraction.
- Implementation of knowledge distillation for interpretability in decision-making.
Read more
Balancing Safety and Efficiency in Aircraft Health Diagnosis: A Task Decomposition Framework with Heterogeneous Long-Micro Scale Cascading and Knowledge Distillation-based Interpretability
Summary
This paper addresses the challenges of aircraft health diagnosis in general aviation, specifically focusing on data uncertainty, task heterogeneity, and computational inefficiency. The authors propose a Diagnosis Decomposition Framework (DDF) that separates the diagnosis process into two distinct subtasks: Anomaly Detection (AD) and Fault Classification (FC). This separation is achieved through the Long-Micro Scale Diagnostician (LMSD), which employs a strategy of 'long-range global screening and micro-scale local precise diagnosis.' The LMSD utilizes a Convolutional Tokenizer with Multi-Head Self-Attention (ConvTokMHSA) for global operational pattern discrimination and a Multi-Micro Kernel Network (MMK Net) for local fault feature extraction. The decoupled training approach allows for optimized pathways for both large-sample lightweight and small-sample complex tasks, significantly reducing computational overhead. Additionally, the Keyness Extraction Layer (KEL) incorporates knowledge distillation to provide interpretable, physically traceable explanations for the two-stage decisions. Experimental results on the NGAFID dataset show a 4-8% improvement in the Multi-Class Weighted Penalty Metric (MCWPM) over baseline models, along with reduced training time, demonstrating the framework's advantages in adaptability, interpretability, and efficiency. This methodology offers a deployable solution for enhancing health management in general aviation.
Methodology
The authors developed the Diagnosis Decomposition Framework (DDF) that decouples the diagnosis process into Anomaly Detection (AD) and Fault Classification (FC). The Long-Micro Scale Diagnostician (LMSD) employs a Convolutional Tokenizer with Multi-Head Self-Attention for global pattern recognition and a Multi-Micro Kernel Network for local feature extraction. The framework also includes a Keyness Extraction Layer for interpretability through knowledge distillation.
Results
The proposed framework achieved a 4-8% improvement in the Multi-Class Weighted Penalty Metric (MCWPM) compared to baseline models while significantly reducing training time, validating its effectiveness in task adaptability, interpretability, and computational efficiency.
Implications
The findings suggest that the DDF can be effectively deployed in general aviation health management, enhancing safety and efficiency in aircraft diagnostics. The interpretability aspect also addresses the need for transparency in aviation safety-critical applications.
MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives
Generative Models
Computer Vision
Theory
- Introduces MCLR, a new training objective that enhances inter-class separation in diffusion models.
- Establishes a theoretical equivalence between classifier-free guidance and alignment objectives.
- Demonstrates that MCLR can achieve CFG-like improvements without requiring inference-time guidance.
- Empirical results show significant qualitative and quantitative gains in generative performance.
Read more
MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives
Summary
This paper addresses the limitations of diffusion models in generative modeling, particularly their reliance on classifier-free guidance (CFG) during inference. While diffusion models trained with denoising score matching (DSM) theoretically recover the target data distribution, they often produce low-quality samples without CFG. The authors identify insufficient inter-class separation as a critical issue and propose a new training objective called MCLR (Maximum Inter-Class Likelihood-Ratio), which enhances class-specific structures by maximizing the likelihood ratios between different classes. Theoretical analysis reveals that CFG can be viewed as an optimal solution to a weighted MCLR objective, establishing a formal equivalence between CFG and alignment-based objectives. Empirical results demonstrate that models fine-tuned with MCLR achieve comparable improvements to those obtained with CFG, without the need for inference-time guidance. This work not only provides a new training paradigm for diffusion models but also offers insights into the mechanisms of CFG, suggesting that better training objectives can lead to improved generative performance.
Methodology
The authors propose the MCLR objective, which maximizes inter-class likelihood ratios during training. They conduct theoretical analyses to show the relationship between CFG and MCLR, and perform empirical evaluations to compare the performance of models trained with MCLR against those using standard DSM and CFG.
Results
Models fine-tuned with MCLR exhibited improvements in sample quality and class-specific feature generation, achieving results comparable to those obtained with CFG. The paper reports significant reductions in FID scores, indicating enhanced fidelity and diversity in generated samples.
Implications
The findings suggest that improving training objectives can lead to better performance in generative models, potentially reducing the need for complex inference-time modifications. This could streamline the deployment of diffusion models in various applications, such as image synthesis and conditional generation tasks.
Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication
Federated Learning
Multimodal
Efficient ML
- Introduction of a BCI-driven framework for immersive communication.
- Development of a personalized federated learning model that processes neurodiverse brain signals.
- Integration of spiking neural networks to reduce energy consumption in on-device learning.
- Experimental validation showing improved accuracy and energy efficiency compared to traditional methods.
Read more
Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication
Summary
This paper introduces a novel framework for immersive communication that utilizes brain-computer interfaces (BCIs) to capture brain signals, enabling the inference of user-centric states such as intention and discomfort. The authors propose a personalized federated learning (PFL) model that processes these brain signals, accommodating neurodiverse data while ensuring the privacy of sensitive information. To address the energy constraints of on-device learning in immersive terminals, the framework incorporates spiking neural networks (SNNs), which leverage sparse, event-driven computations to significantly reduce energy consumption during training and inference. Experimental results demonstrate that the SNN-enabled PFL achieves superior identification accuracy compared to traditional artificial neural network (ANN) approaches, while also reducing energy usage by 6.46 times. This approach not only enhances personalized adaptation in immersive environments but also improves the robustness of user experience across diverse individual profiles.
Methodology
The authors developed a personalized federated learning model that integrates spiking neural networks for processing brain signals acquired through BCIs. This approach allows for efficient computation and energy savings while maintaining high personalization performance. The methodology includes theoretical proofs of reduced gradient dissimilarity across users, enhancing training stability.
Results
The proposed SNN-enabled PFL achieved the highest identification accuracy on real EEG datasets while reducing inference energy consumption by 6.46 times compared to conventional ANN-based personalized baselines.
Implications
This research has significant implications for the development of energy-efficient, user-centric immersive communication systems, particularly in applications involving virtual and augmented reality. It highlights the potential of integrating BCI technology with federated learning to enhance user experience and personalization in real-time.
A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity
Time Series
- XGBoost demonstrated the highest predictive accuracy among the models tested.
- The study utilized a comprehensive dataset of hourly meteorological observations from Chongqing.
- A systematic approach to data preprocessing and feature engineering was employed to enhance model performance.
- The results indicate significant potential for machine learning in short-term meteorological forecasting.
Read more
A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity
Summary
This study addresses the challenge of accurate short-term forecasting of air temperature and relative humidity in urban environments, particularly in complex topographies like Chongqing, China. It systematically compares seven machine learning models: eXtreme Gradient Boosting (XGBoost), Random Forest, Support Vector Regression (SVR), Multi-Layer Perceptron (MLP), Decision Tree, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Network-LSTM (CNN-LSTM). The research employs a unified framework for data preprocessing, feature engineering, and time-series validation to evaluate the models' predictive accuracy and robustness using real-world open data from September 2024 to January 2026. The findings reveal that XGBoost outperforms the other models, achieving a mean absolute error (MAE) of 0.302 °C for air temperature and 1.271% for relative humidity, alongside an average R² of 0.989. These results underscore the effectiveness of tree-based ensemble learning in meteorological forecasting and offer practical insights for intelligent weather prediction in mountainous regions.
Methodology
The methodology involved a systematic comparison of various machine learning models using a unified framework that included data preprocessing, feature engineering (such as lag features and rolling statistics), and rigorous time-series validation. The models were evaluated based on their predictive accuracy and robustness using a dataset of hourly meteorological observations.
Results
XGBoost achieved the best performance with a test mean absolute error of 0.302 °C for air temperature and 1.271% for relative humidity, along with an average R² of 0.989 across both forecasting tasks. This highlights the model's effectiveness in capturing the dynamics of temperature and humidity in a complex urban environment.
Implications
The study's findings have significant implications for urban management, public health, and agricultural activities, particularly in mountainous cities where traditional forecasting methods may struggle. The results can guide the development of intelligent meteorological forecasting systems that leverage machine learning techniques.
Interpretable Multiple Myeloma Prognosis with Observational Medical Outcomes Partnership Data
Interpretability
- Introduction of two training-time regularizers for interpretability in ML models.
- Alignment of complex models with simpler, interpretable models to guide predictions.
- Incorporation of clinical staging systems into the learning objective for consistency.
- Demonstrated competitive predictive performance on real-world clinical data.
Read more
Interpretable Multiple Myeloma Prognosis with Observational Medical Outcomes Partnership Data
Summary
This paper addresses the challenge of interpretability in machine learning (ML) models applied to clinical decision-making, specifically for predicting five-year survival in multiple myeloma (MM) patients. The authors propose two novel regularization techniques that integrate interpretability directly into the training process of ML models. The first technique aligns a flexible neural network with predictions from a simpler, interpretable logistic regression model based on clinically selected features. The second technique ensures that model predictions are consistent with the Revised International Staging System (R-ISS), which is a clinically established framework for stratifying risk in MM patients. The study utilizes clinical data from 812 patients at Helsinki University Hospital, demonstrating that the proposed methods yield models that not only achieve competitive predictive performance (accuracy up to 0.721) but also maintain clinically coherent behavior. By embedding interpretability into the learning objective, the authors provide a framework that enhances the trustworthiness of ML applications in healthcare, moving beyond traditional post-hoc explanation methods.
Methodology
The authors developed two regularization techniques that penalize deviations from predictions made by an interpretable logistic regression model and enforce consistency with the Revised International Staging System (R-ISS). These regularizers were integrated into the training objective of a flexible neural network model, allowing for interpretability to be a fundamental aspect of the model's learning process.
Results
The proposed methods were validated on a dataset of 812 multiple myeloma patients, achieving an accuracy of up to 0.721 on the test set. SHAP values indicated that the models relied on clinically important features, confirming the effectiveness of the interpretability-driven regularization techniques.
Implications
The findings suggest that incorporating interpretability into the training of ML models can enhance their applicability in clinical settings, potentially leading to better decision-making in patient prognosis. This approach could be extended to other areas of healthcare where interpretability is crucial for trust and validation.
KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
Large Language Models
Optimization
Efficient ML
- KV cache optimization is essential for efficient LLM deployment, especially with increasing context lengths.
- The paper categorizes KV cache strategies into five main directions, each with unique trade-offs.
- No single optimization technique is optimal across all scenarios; context length and hardware constraints matter.
- Adaptive multi-stage optimization pipelines are suggested as a future research direction.
Read more
KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
Summary
This paper addresses the critical challenge of key-value (KV) cache optimization in Transformer-based large language models (LLMs), particularly as context lengths increase from thousands to millions of tokens. The authors systematically review recent KV cache optimization techniques, categorizing them into five main strategies: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. Each category is analyzed for its underlying mechanisms, deployment trade-offs, and empirical performance metrics, including memory reduction, throughput, and model accuracy. The paper emphasizes that no single optimization technique is universally superior; instead, the optimal approach is context-dependent, influenced by factors such as context length, hardware constraints, and workload characteristics. The authors propose adaptive, multi-stage optimization pipelines as a promising direction for future research, providing actionable insights for practitioners in various deployment scenarios, including long-context requests, high-throughput serving, and resource-constrained environments.
Methodology
The authors conducted a systematic review of recent KV cache optimization techniques, categorizing them into five principal strategies. They analyzed each category's mechanisms, trade-offs, and empirical performance metrics, mapping techniques to various practical deployment scenarios.
Results
The analysis revealed that different KV cache optimization techniques perform variably across different settings, with no single method dominating. The results highlighted the importance of context length, hardware constraints, and workload characteristics in determining the most effective optimization strategy.
Implications
The findings suggest that practitioners can enhance LLM deployment efficiency by selecting appropriate KV cache strategies based on specific requirements. The proposed adaptive optimization pipelines could lead to significant improvements in inference throughput and resource management in various applications, including edge devices and high-throughput data centers.
A Bayesian Learning Approach for Drone Coverage Network: A Case Study on Cardiac Arrest in Scotland
Optimization
Robotics
- Introduces a Bayesian learning framework for optimizing drone-assisted AED delivery networks.
- Focuses on the survival probability of OHCA patients to determine optimal drone station locations.
- Demonstrates the impact of environmental variability and spatial demand on network design.
- Assesses the economic viability of the proposed network through cost-effectiveness analysis.
Read more
A Bayesian Learning Approach for Drone Coverage Network: A Case Study on Cardiac Arrest in Scotland
Summary
This paper presents a Bayesian learning framework aimed at optimizing drone-assisted Automated External Defibrillator (AED) delivery networks, particularly in the context of responding to Out of Hospital Cardiac Arrest (OHCA) incidents in Scotland. The authors address the challenges of high capital costs and environmental uncertainties that hinder the implementation of such networks. By formulating an objective function based on the survival probability of OHCA patients, the study identifies optimal locations for drone stations while considering the coverage of existing Emergency Medical Services (EMS) infrastructure. The methodology utilizes geographically referenced cardiac arrest data to illustrate how environmental variability and spatial demand patterns affect drone station placement in both urban and rural settings. The results indicate that the proposed drone network design is not only robust but also economically viable, as it is expected to enhance emergency response coverage significantly, particularly in areas with longer ambulance response times. The findings underscore the potential of drone technology to complement traditional EMS, thereby improving survival rates in critical situations.
Methodology
The authors employ a reliability-informed Bayesian learning framework that integrates probabilistic coverage definitions based on response-time distributions. They utilize a quasi Bayesian decision problem approach to quantify uncertainty in facility activation, incorporating prior structural knowledge and empirical demand data. The methodology includes the collection of posterior samples of activation probabilities across different seasonal configurations to enhance network robustness.
Results
The study finds that the optimal placement of drone stations is significantly influenced by environmental factors and spatial demand patterns. The proposed drone-assisted AED delivery network is shown to be cost-effective, with the potential to improve emergency response coverage in both urban and rural areas, particularly where traditional ambulance services face delays.
Implications
The findings suggest that implementing drone-assisted AED delivery networks could substantially enhance emergency medical responses, particularly in remote areas. This approach could lead to improved survival rates for cardiac arrest patients and may serve as a model for integrating UAV technology into existing EMS frameworks.
SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale
Large Language Models
NLP
Efficient ML
- Skill routing is a critical yet under-explored problem in LLM agent ecosystems.
- The full implementation body of skills is essential for accurate skill selection, contrary to previous assumptions.
- SKILLROUTER achieves 74.0% top-1 routing accuracy with a compact architecture suitable for consumer hardware.
- A standardized evaluation benchmark with 80K skills and expert-verified queries is established.
Read more
SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale
Summary
The paper addresses the challenge of skill routing in large-scale ecosystems of LLM agents, where the number of available skills has surged to tens of thousands. Traditional approaches have relied on exposing only skill names and descriptions to agents, assuming this metadata suffices for effective skill selection. However, this study reveals that the full implementation text, or skill body, is crucial for accurate routing. Through a systematic empirical analysis involving approximately 80,000 skills and 75 expert-verified queries, the authors demonstrate that excluding the skill body leads to significant performance degradation. They introduce SKILLROUTER, a two-stage retrieve-and-rerank pipeline that utilizes the complete skill text, achieving a top-1 routing accuracy of 74.0% with a compact model of only 1.2 billion parameters. This design allows for deployment on consumer hardware, making it practical for personal agent applications. The findings challenge existing assumptions about skill selection and provide a new benchmark for evaluating skill routing effectiveness.
Methodology
The authors conducted a systematic empirical study to analyze skill routing, focusing on the importance of different components of skill text (name, description, body). They developed SKILLROUTER, a two-stage retrieve-and-rerank pipeline that leverages the full skill text for improved accuracy. The study involved evaluating various retrieval methods and analyzing attention patterns to determine the significance of the skill body.
Results
The study found that skill routing is challenging, with the best existing methods achieving only 64.0% top-1 accuracy without the skill body. SKILLROUTER, utilizing the full skill text, achieved a top-1 routing accuracy of 74.0% and demonstrated superior performance compared to larger zero-shot baselines. The compact model design allows for practical deployment on consumer devices.
Implications
The findings have significant implications for the design of LLM agents and their skill selection mechanisms. By emphasizing the importance of the skill body, the research encourages the development of more effective routing systems that can operate efficiently on consumer hardware, enhancing the usability of personal agent products.
Confidence Calibration under Ambiguous Ground Truth
Theory
- Traditional confidence calibration methods fail under ambiguous ground truth due to reliance on majority-voted labels.
- Temperature Scaling is biased towards underestimating annotator uncertainty, leading to increased miscalibration.
- The proposed ambiguity-aware calibrators optimize against the full label distribution, improving calibration accuracy.
- Dirichlet-Soft shows the best performance, reducing true-label Expected Calibration Error (ECE) by 55-87%.
Read more
Confidence Calibration under Ambiguous Ground Truth
Summary
This paper addresses the issue of confidence calibration in machine learning models when faced with ambiguous ground truth, where annotators may disagree on labels. Traditional calibration methods, such as Temperature Scaling, assume a unique ground-truth label per input, which can lead to significant miscalibration when there is genuine disagreement among annotators. The authors demonstrate that calibrating models based on majority-voted labels can result in systematic failures, particularly as the entropy of the annotation increases. To tackle this problem, they propose a family of ambiguity-aware post-hoc calibrators that optimize proper scoring rules against the full label distribution without requiring model retraining. The methods include Dirichlet-Soft, which uses the complete annotator distribution, Monte Carlo Temperature Scaling with a single annotation, and Label-Smooth Temperature Scaling that constructs pseudo-soft targets from model confidence. Experiments across multiple benchmarks reveal that the proposed methods significantly improve calibration quality compared to traditional approaches.
Methodology
The authors developed a family of ambiguity-aware post-hoc calibrators that optimize proper scoring rules against the full label distribution. They introduced three methods: Dirichlet-Soft, which utilizes the full annotator distribution; Monte Carlo Temperature Scaling with a single annotation; and Label-Smooth Temperature Scaling, which creates pseudo-soft targets from the model's confidence. The methods were evaluated on four benchmarks with real multi-annotator distributions and synthetic annotations.
Results
The experiments showed that Dirichlet-Soft significantly reduces true-label ECE by 55-87% compared to Temperature Scaling. Label-Smooth Temperature Scaling also demonstrated substantial improvements, reducing ECE by 9-77% without requiring any annotator data. These results highlight the effectiveness of the proposed methods in improving calibration under conditions of annotator disagreement.
Implications
The findings suggest that better calibration methods can enhance the reliability of machine learning models in high-stakes applications, such as medical imaging and natural language processing, where accurate confidence estimates are crucial. The proposed methods could lead to safer deployment of AI systems by ensuring that confidence scores reflect true prediction reliability.
Latent Semantic Manifolds in Large Language Models
Large Language Models
Theory
NLP
- Introduces a rigorous mathematical framework for LLMs as latent semantic manifolds.
- Defines the expressibility gap, measuring the mismatch between continuous representations and finite vocabularies.
- Proves two theorems connecting manifold geometry to limitations of finite vocabularies.
- Validates theoretical predictions across multiple transformer architectures.
Read more
Latent Semantic Manifolds in Large Language Models
Summary
This paper presents a theoretical framework for understanding the internal representation space of Large Language Models (LLMs) as a latent semantic manifold, which is a Riemannian submanifold of the ambient embedding space. The author introduces the concept of the expressibility gap, a geometric quantity that quantifies the limitations of finite vocabularies in capturing continuous semantic states. The paper proves two significant theorems: a lower bound on semantic distortion and a linear scaling law for the expressibility gap. Empirical validation across six transformer architectures demonstrates that intrinsic dimensions follow a universal hourglass pattern, curvature profiles align with smooth manifold structures, and the expressibility gap exhibits predicted linear scaling. The findings suggest that natural language is a coarse quantization of a richer semantic structure, with implications for LLM architecture design, training diagnostics, model compression, and decoding strategies.
Methodology
The paper employs differential geometry to model the internal representations of LLMs as a latent semantic manifold. It utilizes the Fisher information metric to define the manifold and introduces Voronoi tessellation to relate tokens to regions of the manifold. The author derives theoretical bounds using rate-distortion theory and the coarea formula, followed by empirical validation across various transformer architectures.
Results
The paper proves a fundamental lower bound on semantic distortion and a linear volume scaling law for the expressibility gap. Empirical results show that intrinsic dimensions of LLMs follow a consistent hourglass pattern, curvature profiles support a smooth manifold structure, and the expressibility gap scales linearly with high correlation across different architectures.
Implications
The findings have significant implications for the design and training of LLMs, suggesting that understanding the geometric structure of language can lead to better model architectures, improved training diagnostics, effective model compression, and enhanced decoding strategies.
Large Neighborhood Search meets Iterative Neural Constraint Heuristics
Optimization
- Introduces the ConsFormer-LNS framework that combines neural heuristics with Large Neighborhood Search.
- Demonstrates the effectiveness of prediction-guided destroy operators in selecting neighborhoods.
- Finds that stochastic destroy operators outperform greedy ones, while greedy repair methods are more effective than sampling-based ones.
- Shows substantial performance gains over traditional neural and classical baselines in CSP benchmarks.
Read more
Large Neighborhood Search meets Iterative Neural Constraint Heuristics
Summary
This paper explores the intersection of Large Neighborhood Search (LNS) and iterative neural heuristics for solving constraint satisfaction problems (CSPs). The authors adapt a neural constraint satisfaction method, ConsFormer, into an LNS framework, which involves decomposing the process into two components: destroy and repair operators. The destroy phase employs classical heuristics and novel prediction-guided operators that leverage the neural model's internal scores to select neighborhoods. In the repair phase, ConsFormer serves as a neural repair operator, with comparisons made between a sampling-based decoder and a greedy decoder for selecting assignments. The empirical study conducted on Sudoku, Graph Coloring, and MaxCut demonstrates that the integration of neural heuristics into the LNS framework significantly enhances performance compared to traditional methods. The findings reveal that stochastic destroy operators are more effective than greedy ones, while greedy repair methods outperform sampling-based approaches for achieving high-quality feasible assignments. This work highlights the potential of LNS as a framework for improving iterative neural approaches in solving CSPs.
Methodology
The authors reinterpret the ConsFormer neural network as an implicit LNS solver, embedding it within an explicit LNS procedure. They decompose the process into destroy and repair components, utilizing classical heuristics alongside novel prediction-guided operators for neighborhood selection. The empirical evaluation involves systematic comparisons of various LNS heuristics on benchmark CSPs.
Results
The adaptation of the neural heuristic to the LNS framework resulted in significant performance improvements over its standalone iterative deployment. The study showed that the new approach is competitive with both classical and other neural methods across multiple CSP benchmarks, with specific patterns observed in operator effectiveness.
Implications
This research suggests that integrating neural networks with classical search techniques like LNS can lead to more robust and adaptable solutions for constraint satisfaction problems, potentially impacting fields such as automated decision-making, resource management, and scheduling.
A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
Optimization
Reinforcement Learning
Theory
- Introduces a novel method for long-horizon stochastic optimal control using Schrödinger eigenfunctions.
- Demonstrates that the HJB equation can be reduced to a linear PDE under specific conditions.
- Achieves significant improvements in control accuracy and efficiency compared to state-of-the-art methods.
- Proposes a new loss function for eigenfunction learning that mitigates performance degradation.
Read more
A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
Summary
This paper addresses the challenges of high-dimensional stochastic optimal control (SOC) problems, particularly as the planning horizon increases. Traditional methods struggle with linear scaling in horizon T, leading to performance degradation. The authors focus on a subclass of linearly-solvable SOC problems characterized by an uncontrolled drift that is the gradient of a potential. They demonstrate that the Hamilton-Jacobi-Bellman (HJB) equation simplifies to a linear partial differential equation (PDE) governed by an operator L, which is shown to be unitarily equivalent to a Schrödinger operator with a discrete spectrum. This connection allows for efficient long-horizon control solutions through the eigensystem of L. The authors derive two significant results: first, they provide an analytic solution for symmetric linear-quadratic regulators (LQR) by relating it to the eigensystem of a quantum harmonic oscillator. Second, they propose a neural network approach to learn the eigensystem of L, addressing implicit reweighting issues in existing eigenfunction learning losses. Their method achieves substantial improvements in control accuracy on long-horizon benchmarks, while also reducing memory usage and runtime complexity.
Methodology
The authors leverage the relationship between the Hamilton-Jacobi-Bellman equation and a Schrödinger operator to reformulate the SOC problem. They validate the spectral properties of the operator L and utilize neural networks to learn its eigensystem. A new loss function is introduced to enhance the performance of eigenfunction learning in control tasks.
Results
The proposed method demonstrates an order-of-magnitude improvement in control accuracy on various long-horizon benchmarks. Additionally, it reduces memory usage and runtime complexity from O(Td) to O(d), showcasing its efficiency compared to existing methods.
Implications
This work has potential applications in fields requiring long-horizon decision-making under uncertainty, such as robotics, finance, and stochastic filtering. The connection to quantum mechanics may also inspire further interdisciplinary research.
Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy
Theory
Large Language Models
NLP
- Semantic-preserving steganographic embedding increases Kolmogorov complexity.
- The complexity increase is quantified by K(M2) ≥ K(M1) + K(P) - O(log n).
- Language-model perplexity serves as a computable proxy for detecting complexity increases.
- The Binoculars perplexity-ratio score effectively distinguishes stegotext from non-stegotext.
Read more
Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy
Summary
This paper investigates the information-theoretic aspects of steganography using large language models (LLMs), focusing on the relationship between semantic-preserving text embedding and Kolmogorov complexity. The author establishes that any steganographic scheme that maintains the semantic integrity of the original text while embedding a payload must exhibit an increase in complexity, quantified by the formula K(M2) ≥ K(M1) + K(P) - O(log n), where M1 is the covertext, M2 is the stegotext, P is the payload, and n is the combined message length. This finding implies that any non-trivial payload will lead to a strict increase in the complexity of the stegotext. Given the uncomputability of Kolmogorov complexity, the paper proposes using language-model perplexity as a practical proxy for detecting this complexity increase. The Binoculars perplexity-ratio score is introduced as a specific metric for this purpose. Preliminary experiments validate the theoretical predictions, showing that the Binoculars score can significantly differentiate stegotext from baseline texts, with a paired t-test yielding a t-value of 5.11 and a p-value less than 10^-6.
Methodology
The author derives theoretical results regarding Kolmogorov complexity in the context of semantic-preserving steganography. The paper introduces the Binoculars perplexity-ratio score as a practical detection metric and conducts preliminary experiments to validate the theoretical predictions using a color-based LLM steganographic scheme.
Results
The main result demonstrates that any non-trivial payload embedded in a stegotext leads to a strict increase in complexity. The Binoculars score is validated through experiments, showing a significant ability to distinguish stegotext from baseline texts, with a t-test indicating strong statistical significance.
Implications
The findings have implications for the security and monitoring of AI systems, particularly in understanding covert communication channels between AI models. The proposed detection proxy could enhance alignment monitoring and oversight of AI-generated content.
CRPS-Optimal Binning for Conformal Regression
Theory
Efficient ML
Interpretability
- Introduces a method for non-parametric conditional distribution estimation using optimal binning.
- Derives a closed-form LOO-CRPS cost function for efficient computation.
- Utilizes dynamic programming to find the globally optimal K-partition.
- Proposes a cross-validated approach for selecting the number of bins to avoid in-sample optimism.
Read more
CRPS-Optimal Binning for Conformal Regression
Summary
This paper presents a novel method for non-parametric conditional distribution estimation through the optimal binning of covariate-sorted observations. The proposed approach utilizes the empirical cumulative distribution function (CDF) within each bin as the predictive distribution. The bin boundaries are determined by minimizing the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which is computationally efficient with a closed-form cost function requiring O(n² log n) precomputation and O(n²) storage. The globally optimal K-partition is obtained using dynamic programming in O(n²K) time. The study identifies that minimizing within-sample LOO-CRPS can lead to in-sample optimism, thus a cross-validated test CRPS criterion is introduced to select the optimal number of bins K. The method constructs two predictive objects: the Venn prediction band and a conformal prediction set, both of which provide finite-sample marginal coverage guarantees. Empirical evaluations demonstrate that this method yields significantly narrower prediction intervals compared to existing split-conformal competitors while maintaining near-nominal coverage.
Methodology
The methodology involves partitioning sorted observations into contiguous bins and using the empirical CDF within each bin for prediction. The optimal bin boundaries are determined by minimizing the LOO-CRPS, with a dynamic programming approach employed to find the best K-partition efficiently. A cross-validation technique is used to select the optimal number of bins, and the resulting predictive distributions are constructed to ensure finite-sample coverage guarantees.
Results
The proposed method shows significant improvements in prediction intervals, yielding narrower intervals while maintaining near-nominal coverage rates when benchmarked against existing split-conformal methods such as Gaussian split conformal, CQR, and CQR-QRF.
Implications
This work has implications for enhancing the accuracy and reliability of predictive modeling in various applications, particularly in fields requiring robust uncertainty quantification and decision-making under risk. The method's efficiency and effectiveness in constructing prediction sets can be beneficial in areas such as finance, healthcare, and environmental modeling.
Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
NLP
Large Language Models
Efficient ML
- Introduction of HADES, a GSP-inspired framework for adaptive filtering in SSMs.
- Hierarchical architecture with shared and expert filters enhances model efficiency and expressivity.
- Achieves competitive performance while using significantly fewer parameters than Mamba2.
- Demonstrates effective capture of local and global dependencies in sequence data.
Read more
Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
Summary
This paper introduces HADES, a novel framework that reinterprets the Mamba2 state-space model (SSM) as an adaptive filter bank using principles from Graph Signal Processing (GSP). Mamba2, while demonstrating strong performance in language modeling and other tasks, operates its multi-head recurrence independently, leading to inefficiencies. HADES addresses this by organizing filters into shared and expert types, allowing for both global low-pass and local high-pass filtering behaviors. This hierarchical structure enhances the model's expressivity while significantly reducing the number of parameters required. The authors validate HADES against various benchmarks, showing that it achieves comparable performance to Mamba2 while utilizing only 58.9% of its parameters. The paper emphasizes the interpretability of the model through spectral analysis, revealing how the adaptive filtering strategy captures dependencies in sequence data effectively. Overall, HADES bridges GSP and neural sequence modeling, providing a scalable and interpretable approach to SSMs.
Methodology
HADES reinterprets the Mamba2 model as a graph filter bank on a line graph, where tokens are nodes and their temporal connections form edges. The architecture includes shared filters for global behavior and expert filters for local behavior, utilizing delta modulation and structured biases to optimize filtering. Comprehensive spectral analysis is conducted to examine the model's internal dynamics.
Results
HADES achieves performance comparable to Mamba2 across various benchmarks, including language modeling, commonsense reasoning, and long-context retrieval, while using only 58.9% of the parameters of Mamba2. The spectral analysis reveals that the adaptive filtering strategy effectively captures both local and global dependencies.
Implications
The integration of GSP principles into neural sequence modeling could lead to more efficient and interpretable models in various applications, particularly in natural language processing tasks. HADES may serve as a foundation for future research in optimizing state-space models and enhancing their performance in real-world applications.
Breaking the $O(\sqrt{T})$ Cumulative Constraint Violation Barrier while Achieving $O(\sqrt{T})$ Static Regret in Constrained Online Convex Optimization
Optimization
Theory
- Introduces an algorithm that achieves O(√T) static regret and O(T^(1/3)) cumulative constraint violation for constrained online convex optimization in 2 dimensions.
- Refutes the belief that CCV must be at least O(√T) when static regret is O(√T) for dimensions d ≥ 2.
- Demonstrates the importance of geometric properties of feasible sets in optimizing both regret and CCV.
- Builds on prior work while providing a more efficient solution for specific cases in COCO.
Read more
Breaking the $O(\sqrt{T})$ Cumulative Constraint Violation Barrier while Achieving $O(\sqrt{T})$ Static Regret in Constrained Online Convex Optimization
Summary
This paper addresses the problem of constrained online convex optimization (COCO), where a learner must choose actions subject to convex constraints and loss functions revealed after the action is taken. The authors aim to minimize both static regret and cumulative constraint violation (CCV) in this setting. Previous work established that achieving O(√T) regret typically resulted in a CCV of at least O(√T) for d ≥ 2 dimensions. However, this paper refutes that belief by presenting an algorithm that achieves O(√T) regret and O(T^(1/3)) CCV specifically for the case when the dimension d = 2. The authors detail the structure of the problem, the assumptions made, and the implications of their findings, which challenge existing lower bounds on CCV in COCO. The proposed method leverages geometric properties of the nested sets of feasible actions to improve the CCV without compromising on regret, marking a significant advancement in the field of online optimization.
Methodology
The authors develop a new algorithm that utilizes the geometric structure of the feasible sets in COCO. By focusing on the properties of nested convex sets, they derive bounds for both regret and CCV that improve upon previous results. The algorithm operates in a two-dimensional space, allowing for a more nuanced approach to managing constraints and losses.
Results
The main results indicate that the proposed algorithm achieves O(√T) static regret while simultaneously reducing the cumulative constraint violation to O(T^(1/3)) in the case of two-dimensional action spaces. This is a notable improvement over previous algorithms that were constrained by O(√T) CCV.
Implications
The findings have significant implications for online learning and optimization, particularly in scenarios where constraints are dynamic and must be managed in real-time. The ability to reduce CCV while maintaining low regret can enhance the performance of algorithms in various applications, including resource allocation, financial decision-making, and adaptive control systems.
Full waveform inversion method based on diffusion model
Generative Models
Optimization
Theory
- Introduction of a conditional diffusion model for full waveform inversion.
- Improvement of inversion resolution and structural fidelity through density information integration.
- Enhanced stability and robustness in complex inversion scenarios.
- Utilization of implicit prior distributions to regularize the inversion process.
Read more
Full waveform inversion method based on diffusion model
Summary
This paper presents a novel full waveform inversion (FWI) method that utilizes a conditional diffusion model to enhance the resolution and stability of seismic subsurface parameter estimation. Traditional FWI techniques often struggle with nonlinearity and reliance on initial models, leading to local minima issues. The authors propose a conditional diffusion model that incorporates two-dimensional density information into a U-Net architecture, addressing the physical coupling between velocity and density. Experimental results demonstrate that this approach significantly improves inversion outcomes, offering better structural fidelity and robustness in complex scenarios. The method effectively leverages density constraints to enhance inversion accuracy, showcasing its practical application potential in seismic imaging.
Methodology
The proposed method enhances the traditional full waveform inversion framework by integrating a conditional diffusion model. This model is built on a U-Net architecture that takes two-dimensional density information as a conditional input, allowing for a more accurate representation of the physical relationships between seismic wave propagation and subsurface properties.
Results
The experimental results indicate that the conditional diffusion model significantly improves the resolution and structural fidelity of the inversion results compared to traditional methods. The new approach also exhibits greater stability and robustness when applied to complex geological scenarios, effectively utilizing density information to constrain the inversion process.
Implications
The findings suggest that the conditional diffusion model can be a valuable tool in seismic imaging, potentially leading to more accurate subsurface models. This method could enhance the efficiency of geological surveys and resource exploration, contributing to advancements in geophysical research and applications.
Towards Practical Multimodal Hospital Outbreak Detection
Multimodal
- Integration of MALDI-TOF, AR patterns, and EHR data significantly improves outbreak detection performance.
- A tiered surveillance paradigm is proposed to reduce reliance on costly WGS.
- The study identifies high-risk clinical procedures that can inform proactive infection prevention strategies.
- Machine learning techniques are employed to extract discriminative features from diverse data modalities.
Read more
Towards Practical Multimodal Hospital Outbreak Detection
Summary
This paper addresses the critical need for rapid outbreak detection in hospitals to control pathogens with epidemic potential. While whole genome sequencing (WGS) is the gold standard for outbreak investigations, its high costs and lengthy turnaround times limit its routine use, especially in less-equipped facilities. The authors propose a machine learning framework that leverages three alternative modalities: matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry, antimicrobial resistance (AR) patterns, and electronic health records (EHR) to enhance outbreak detection. The study utilizes a dataset of 4,921 isolates from 17 bacterial species collected during routine surveillance at a large academic hospital. By integrating these modalities, the authors demonstrate improved detection performance compared to WGS-based methods. They also introduce a tiered surveillance paradigm that reduces reliance on WGS and identifies high-risk clinical procedures linked to contamination routes, providing actionable insights for infection prevention teams. The findings suggest that the proposed multimodal approach can significantly enhance the efficiency of outbreak detection in hospital settings.
Methodology
The study utilized a proprietary dataset of isolates, employing machine learning to extract features from MALDI-TOF spectra, AR patterns, and EHR data. Ground truth outbreak clusters were established using WGS-derived SNP distances, and the performance of the proposed methods was evaluated through cluster-grouped 4-fold cross-validation.
Results
The integration of the three modalities resulted in significantly enhanced outbreak detection performance compared to using WGS alone. The proposed tiered surveillance approach demonstrated potential for reducing the need for WGS while effectively identifying high-risk contamination routes linked to specific clinical procedures.
Implications
The findings suggest that hospitals can implement more efficient and cost-effective outbreak detection strategies using multimodal data, potentially leading to improved patient safety and reduced morbidity associated with hospital-acquired infections.
Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs
Graph Learning
- TAGBD is a novel framework for text-only backdoor attacks on text-attributed graphs.
- The attack leverages uncertainty-guided node selection and graph-aware trigger generation.
- Two injection strategies (Overwriting and Appending) allow for a trade-off between attack strength and stealth.
- Experiments show TAGBD achieves high attack success rates while preserving clean accuracy.
Read more
Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs
Summary
This paper addresses the security vulnerabilities in text-attributed graphs (TAGs) where an attacker can manipulate node texts to implant backdoors without altering the graph structure. The authors propose TAGBD, a novel text-only backdoor attack framework that identifies vulnerable training nodes and generates natural-looking trigger texts using a shadow graph model. The attack is designed to be stealthy and effective, allowing the attacker to control the model's predictions by simply modifying the text associated with nodes. The methodology includes an uncertainty-guided selection strategy for identifying susceptible nodes and a dual injection strategy (Overwriting and Appending) to introduce the trigger text. The experiments conducted on benchmark datasets demonstrate the effectiveness of TAGBD, achieving high attack success rates while maintaining competitive clean accuracy. The findings highlight the need for defenses that consider both graph structure and node content to mitigate such attacks.
Methodology
The authors developed TAGBD, which first identifies vulnerable nodes using an uncertainty-guided selection strategy. A lightweight trigger generator, TextTrojan, is trained alongside a shadow GNN to create contextually relevant trigger texts. The generated texts are then injected into the training data using either an Overwriting or Appending strategy.
Results
TAGBD achieved attack success rates of 100.00%, 99.85%, and 99.96% on the Cora, Pubmed, and Arxiv datasets, respectively. The method maintained competitive clean accuracy and demonstrated resilience against common defense mechanisms.
Implications
The findings suggest that text alone can serve as a practical attack channel in graph learning systems, indicating a need for enhanced security measures that jointly consider graph structure and node content. This research could influence the design of future defenses against backdoor attacks in machine learning systems.
Hybrid Associative Memories
NLP
Large Language Models
Efficient ML
- Introduction of the Hybrid Associative Memory (HAM) layer combining RNNs and self-attention.
- HAM allows precise control over KV-cache growth, enabling flexible performance trade-offs.
- Empirical results show HAM outperforms state-of-the-art RNNs and is competitive with Transformers.
- Detailed analysis of HAM's internal workings enhances understanding of its performance dynamics.
Read more
Hybrid Associative Memories
Summary
The paper introduces the Hybrid Associative Memory (HAM) layer, which integrates recurrent neural networks (RNNs) and self-attention mechanisms to leverage their complementary strengths. RNNs efficiently compress sequences into a fixed-size state, while self-attention excels at retrieving contextual information but incurs high memory and computational costs. The HAM layer addresses these issues by allowing the RNN to summarize predictable content and using the KV cache to store only the surprising tokens that the RNN struggles to predict. This design enables data-dependent growth of the KV cache, controlled by a continuous threshold, facilitating a smooth trade-off between performance and memory usage. The authors empirically demonstrate that the HAM architecture achieves competitive performance compared to both RNNs and Transformers, while utilizing significantly less KV cache. Additionally, the paper provides an in-depth analysis of the internal mechanisms of the HAM layer, revealing insights into KV-cache growth and routing behavior.
Methodology
The authors propose the HAM layer that integrates RNNs and self-attention by allowing the RNN to capture predictable content while the KV cache retains only the surprising tokens. The growth of the KV cache is controlled by a continuous threshold, allowing for dynamic adjustment based on the input structure. The methodology includes empirical evaluations against existing models to assess performance and memory efficiency.
Results
The HAM architecture demonstrates strong performance across various tasks, outperforming state-of-the-art RNNs and achieving competitive results compared to Transformers, all while using substantially less KV cache. The study highlights the effectiveness of the HAM layer in balancing computational efficiency and memory usage.
Implications
The findings suggest that the HAM layer could be beneficial for applications requiring efficient sequence processing with long contexts, such as natural language processing and time series analysis. The ability to control memory usage dynamically may lead to more efficient model designs in large-scale machine learning applications.
Permutation-Symmetrized Diffusion for Unconditional Molecular Generation
Generative Models
- Introduces a novel diffusion model that directly incorporates permutation symmetry in molecular generation.
- Derives an explicit expression for the heat kernel on the quotient manifold, enhancing understanding of diffusion processes.
- Utilizes MCMC to approximate the permutation-symmetrized score, addressing challenges in training.
- Demonstrates competitive performance in unconditional 3D molecular generation tasks on the QM9 dataset.
Read more
Permutation-Symmetrized Diffusion for Unconditional Molecular Generation
Summary
This paper addresses the challenge of permutation invariance in molecular point-cloud generation, a critical aspect in the field of generative modeling. Traditional diffusion models typically enforce permutation invariance indirectly through permutation-equivariant networks in an ordered space. The authors propose a novel approach by modeling diffusion directly on the quotient manifold, where all atom permutations are identified. They derive an explicit expression for the heat kernel on this manifold, demonstrating how diffusion on the quotient differs qualitatively from ordered-particle diffusion. The training process involves a permutation-symmetrized score, which is approximated using Markov Chain Monte Carlo (MCMC) methods. The authors evaluate their method on unconditional 3D molecular generation tasks using the QM9 dataset, showing that their quotient-based approach not only maintains competitive generation quality but also improves efficiency compared to existing methods.
Methodology
The authors model diffusion directly on the quotient manifold of molecular configurations, where permutations of atoms are treated as identical. They derive the heat kernel for this manifold and utilize MCMC to approximate the permutation-symmetrized score necessary for training the diffusion model.
Results
The proposed method shows competitive generation quality in unconditional 3D molecular generation tasks, achieving improved efficiency compared to traditional permutation-equivariant diffusion models. The evaluation on the QM9 dataset under the EQGAT-Diff framework confirms the effectiveness of the approach.
Implications
This work has significant implications for the field of molecular generation, suggesting that direct modeling on quotient manifolds can enhance the efficiency and quality of generative models. It opens avenues for further research into symmetry in generative processes and could lead to advancements in drug discovery and materials science.
A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning
Federated Learning
Optimization
Efficient ML
- Introduces a framework for energy-aware learning in Federated Learning.
- Proposes Cost-Weighted Magnitude Pruning (CWMP) as an optimal greedy solution for energy-efficient gradient pruning.
- Demonstrates that CWMP significantly improves performance-energy trade-offs compared to traditional Top-K pruning methods.
- Formalizes the energy costs associated with parameter updates, addressing hardware-level disparities.
Read more
A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning
Summary
This paper addresses the energy limitations faced by decentralized edge devices in Federated Learning (FL) by proposing a novel approach to gradient pruning that incorporates energy efficiency. Traditional methods like Top-K magnitude pruning reduce communication overhead but do not consider the varying energy costs associated with different parameter updates. The author formalizes the pruning process as an energy-constrained projection problem, introducing Cost-Weighted Magnitude Pruning (CWMP) as a selection rule that prioritizes updates based on their magnitude relative to their physical cost. The paper demonstrates that CWMP is the optimal greedy solution to this constrained projection and provides a probabilistic analysis of its global energy efficiency. Numerical experiments on a non-IID CIFAR-10 benchmark show that CWMP consistently outperforms the Top-K baseline, establishing a superior performance-energy Pareto frontier.
Methodology
The author formalizes the gradient pruning process as an energy-constrained projection problem, introducing a computational energy measure to account for hardware-level disparities in parameter update costs. CWMP is derived as a selection rule that ranks parameters based on their efficiency density, optimizing for both informational relevance and physical energy costs. Numerical experiments are conducted using a non-IID CIFAR-10 dataset to evaluate the performance of CWMP against the Top-K baseline.
Results
The results indicate that CWMP establishes a superior performance-energy Pareto frontier compared to the Top-K pruning method. The convergence trajectory of CWMP is nearly identical to that of the dense-informed Top-K, suggesting that it maintains model performance while improving energy efficiency.
Implications
The proposed framework and pruning method can enhance the sustainability of Federated Learning systems, particularly in resource-constrained environments. By optimizing for energy efficiency, this approach can lead to more effective deployment of machine learning models on edge devices, ultimately contributing to greener AI practices.
Trained Persistent Memory for Frozen Decoder-Only LLMs
Large Language Models
NLP
Generative Models
- Adaptation of six memory methods to decoder-only LLMs, replacing cross-attention with self-attention.
- Identification of an inductive-bias dichotomy where only methods with strong architectural priors succeed at lower capacities.
- Demonstration that all methods converge at higher capacities, indicating architectural bias influences performance.
- Establishment of persistent latent-space memory as a general paradigm for transformer models.
Read more
Trained Persistent Memory for Frozen Decoder-Only LLMs
Summary
This paper investigates the adaptation of trained persistent memory mechanisms to frozen decoder-only language models (LLMs), specifically GPT-2. Traditional decoder-only models, such as GPT-2, are stateless and discard hidden representations after each forward pass, lacking a mechanism for memory retention across sessions. The author builds on previous work that introduced persistent latent-space memory for encoder-decoder models, proposing that similar principles can be applied to decoder-only architectures. The study adapts six memory methods—prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write—specifically for the self-attention mechanism of decoder-only models. The findings reveal a significant inductive-bias dichotomy, where only three methods with strong architectural priors succeed at standard capacity (1×), while all methods converge at higher capacity (10×). This suggests that architectural design plays a crucial role in the efficiency of memory retention mechanisms. The paper establishes persistent latent-space memory as a viable paradigm across major transformer architectures, contributing to the understanding of memory in LLMs.
Methodology
The study adapts six memory methods from previous research to a frozen GPT-2 model, focusing on self-attention mechanisms. The write rule is shared among methods, while the read injection is modified to fit the decoder-only architecture. The performance of these methods is evaluated using two metrics: retained-memory scores and knowledge-accumulation metrics over multiple sessions.
Results
At standard capacity (1×), only three methods (cross-attention, Hebbian, and slot write) achieved significant retained-memory scores (7-18%) and knowledge gains (7-10), while the other three methods performed poorly (<0.4%). At increased capacity (10×), all six methods converged, highlighting that the initial performance gap was due to architectural biases rather than fundamental limitations.
Implications
The findings suggest that persistent memory mechanisms can enhance the capabilities of decoder-only LLMs, enabling them to retain information across sessions. This has potential applications in conversational AI, where maintaining context over long interactions is crucial. The study also opens avenues for further research into memory architectures in various transformer models.
On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors
Theory
Optimization
- Overparametrization significantly influences the shape and geometry of BNN posteriors.
- Three key phenomena—balancedness, weight reallocation, and prior conformity—emerge from redundancy in overparametrized models.
- The study provides a theoretical foundation linking optimization properties to prior choices and posterior shapes.
- Extensive experiments show that overparametrization improves sampling-based inference in BNNs.
Read more
On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors
Summary
This paper investigates the interplay between priors and overparametrization in Bayesian Neural Networks (BNNs), addressing the challenges associated with posterior inference. The authors argue that traditional views on BNN posteriors being impractical due to symmetries and non-identifiabilities can be re-evaluated through the lens of overparametrization. They identify three phenomena that emerge from redundancy in overparametrized models: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. These phenomena reshape the geometry of BNN posteriors, leading to improved understanding and performance in sampling-based inference. The authors validate their theoretical insights through extensive experiments, demonstrating that overparametrization leads to structured, prior-aligned weight posterior distributions, which enhances the effectiveness of sampling methods compared to traditional variational approaches.
Methodology
The authors conducted a theoretical analysis of the effects of overparametrization on BNN posteriors, translating concepts from optimization literature into insights about prior choices. They performed extensive empirical evaluations using sampling-based inference methods, with sampling budgets significantly exceeding those of previous studies to analyze the impact of posterior shapes.
Results
The results indicate that overparametrization leads to more structured and aligned weight posterior distributions, which enhances the performance of sampling-based inference methods. The experiments demonstrated that the posterior landscape becomes less fragmented in overparametrized models, supporting the theoretical claims made in the paper.
Implications
The findings suggest that overparametrization can be leveraged to improve uncertainty quantification in BNNs, making them more practical for real-world applications. This work opens avenues for further research into the design of priors and the architecture of neural networks to optimize posterior inference.
Hybrid Autoencoder-Isolation Forest approach for time series anomaly detection in C70XP cyclotron operation data at ARRONAX
Time Series
- Proposes a hybrid AE-IF model to improve anomaly detection in time series data.
- Addresses limitations of standard Isolation Forest in detecting subtle anomalies.
- Utilizes reconstruction errors from an Autoencoder as input features for Isolation Forest.
- Demonstrates improved detection performance validated on real-world cyclotron operation data.
Read more
Hybrid Autoencoder-Isolation Forest approach for time series anomaly detection in C70XP cyclotron operation data at ARRONAX
Summary
This paper presents a novel hybrid approach combining Autoencoder (AE) and Isolation Forest (IF) for anomaly detection in time series data from the C70XP cyclotron at ARRONAX. The cyclotron, used for medical and research radioisotope production, faces operational disruptions due to system failures, necessitating effective early anomaly detection methods. The authors highlight the limitations of the standard IF method, particularly its reliance on axis-parallel splits, which can hinder the detection of subtle anomalies. To address this, the proposed method uses an AE to reconstruct sensor data and compute the Mean Cubic Error (MCE), which serves as input for the IF model. By transforming the data into a reconstruction-error feature space, subtle anomalies become more distinguishable, enhancing detection performance. The methodology was validated using proton beam intensity time series data, demonstrating significant improvements in anomaly detection compared to traditional IF approaches.
Methodology
The methodology involves segmenting the temporal beam intensity signal into time windows, where an Autoencoder reconstructs each window and computes the Mean Cubic Error (MCE). This MCE is then used as input for the Isolation Forest model, allowing for enhanced detection of anomalies, particularly subtle local anomalies that may indicate early component degradation.
Results
The proposed AE-IF approach showed a clear improvement in detection performance when validated on proton beam intensity time series data, effectively identifying both global and subtle local anomalies that standard IF methods struggled to detect.
Implications
The hybrid AE-IF model has potential applications in safety-critical industrial systems, particularly in environments where early detection of anomalies is crucial to prevent failures and costly repairs. This approach could be extended to other domains requiring time series anomaly detection.
PLR: Plackett-Luce for Reordering In-Context Learning Examples
NLP
Large Language Models
Optimization
- PLR introduces a distributional approach to ICL example ordering, enhancing performance without requiring exhaustive search.
- The method is label-space agnostic, making it applicable to a wider range of tasks including open-ended generation.
- PLR employs a Gumbel perturb-and-sort procedure for efficient sampling of example orderings.
- Experiments show significant improvements in few-shot accuracy across multiple benchmarks.
Read more
PLR: Plackett-Luce for Reordering In-Context Learning Examples
Summary
The paper introduces PLR, a novel probabilistic approach to optimize the ordering of in-context learning (ICL) examples for large language models (LLMs). Recognizing that the performance of LLMs is sensitive to the order of examples, the authors propose using a Plackett-Luce distribution to model the space of possible orderings. Instead of exhaustively searching through the factorial number of permutations, which is computationally infeasible, PLR learns a probability distribution over these orderings and iteratively updates it to favor higher-performing configurations based on a task-level metric. The methodology employs a Gumbel perturb-and-sort technique for efficient sampling of candidate orderings. The authors validate PLR through experiments on various classification benchmarks and mathematical reasoning tasks, demonstrating significant improvements in few-shot accuracy compared to existing methods. The results indicate that PLR is particularly effective in scenarios where traditional label-based ordering methods fall short, such as in open-ended generation and numerical reasoning tasks.
Methodology
PLR utilizes a Plackett-Luce distribution to model the probability of different example orderings. It employs an iterative algorithm that includes sampling techniques to fit the distribution, with three proposed methods for updates: heuristic rank updates, maximum likelihood estimation (MLE), and expectation-maximization (EM) for mixtures of Plackett-Luce distributions. The updates are stabilized using exponential moving averages (EMA).
Results
The experiments conducted on classification and reasoning benchmarks with models like Qwen and Llama show that PLR consistently outperforms baseline methods in terms of few-shot accuracy. The results highlight the effectiveness of PLR in optimizing example orderings, particularly in challenging tasks where traditional methods struggle.
Implications
The findings suggest that optimizing the ordering of in-context examples can significantly enhance the performance of LLMs, making PLR a valuable tool for practitioners in natural language processing and related fields. This approach could lead to more robust applications of LLMs in various domains, including open-ended generation and complex reasoning tasks.
Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates
Theory
Optimization
- Neural operators are vulnerable to sparse, physically plausible adversarial perturbations.
- Minimal modifications can lead to catastrophic prediction failures, undetectable by standard validation metrics.
- The effective perturbation dimension (deff) is introduced as a diagnostic tool for assessing vulnerability.
- Gradient-free search methods outperform gradient-based methods in exploiting these vulnerabilities.
Read more
Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates
Summary
This paper investigates the adversarial vulnerabilities of neural operator models, which are increasingly used as digital twins in nuclear and energy systems. The authors demonstrate that these models are susceptible to minimal, physically plausible perturbations that can lead to significant prediction errors, even when only a small fraction of inputs (less than 1%) is altered. Through the use of gradient-free differential evolution across various operator architectures, the study reveals that such perturbations can escalate the relative L2 error from approximately 1.5% to between 37% and 63%. The authors introduce a new diagnostic measure, the effective perturbation dimension (deff), which helps explain the varying levels of vulnerability across different architectures. The findings highlight that traditional validation metrics fail to detect these adversarial attacks, emphasizing the need for enhanced robustness measures in the deployment of neural operators in safety-critical applications.
Methodology
The authors employed gradient-free differential evolution techniques to explore the adversarial vulnerabilities of various neural operator architectures. They analyzed the impact of minimal input perturbations on prediction accuracy and introduced the effective perturbation dimension (deff) as a new diagnostic tool to assess model sensitivity and vulnerability.
Results
The study found that adversarial attacks could increase the relative L2 error from about 1.5% to between 37% and 63%. It was shown that 100% of successful single-point attacks passed z-score anomaly detection, indicating a significant gap in current validation methods. The effective perturbation dimension (deff) revealed that architectures with moderate sensitivity concentration were the most exploitable.
Implications
The findings highlight critical vulnerabilities in neural operator models used for safety-critical applications, such as nuclear thermal-hydraulics. This necessitates the development of more robust validation and defense mechanisms to ensure the reliability of these models in real-world deployments.
Universal and efficient graph neural networks with dynamic attention for machine learning interatomic potentials
Graph Learning
Efficient ML
- MLANet introduces a dual-path dynamic attention mechanism for improved message passing in graph neural networks.
- The model achieves high accuracy while significantly reducing computational costs compared to mainstream equivariant models.
- MLANet is validated across a wide range of datasets, demonstrating its versatility in modeling various atomic environments.
- The framework enables stable long-time molecular dynamics simulations, addressing critical challenges in current MLIP approaches.
Read more
Universal and efficient graph neural networks with dynamic attention for machine learning interatomic potentials
Summary
This paper presents MLANet, a novel graph neural network framework designed for machine learning interatomic potentials (MLIPs) that addresses the challenges of efficiency and stability in molecular dynamics simulations. Traditional empirical potentials are limited in accuracy, while first-principles methods are computationally expensive. MLANet employs a dual-path dynamic attention mechanism for geometry-aware message passing and a multi-perspective pooling strategy to enhance system representation. The model is tested across various datasets, including organic molecules, inorganic materials, and surface catalytic reactions, demonstrating competitive prediction accuracy with significantly lower computational costs compared to existing equivariant models. MLANet's architecture allows for stable long-time molecular dynamics simulations, making it a practical tool for high-fidelity atomic simulations across diverse scientific applications.
Methodology
MLANet utilizes a geometry-aware dual-path dynamic attention mechanism within its equivariant message-passing layers, which adaptively modulates interatomic interactions based on geometric and chemical features. It also incorporates a multi-perspective pooling strategy that combines different pooling operations to mitigate information loss and enhance representation.
Results
MLANet demonstrates competitive performance in predicting energies and forces across various datasets, including QM7, MD17, and inorganic materials. It maintains accuracy comparable to existing models while achieving a marked reduction in computational cost, enabling efficient long-time molecular dynamics simulations.
Implications
MLANet's efficient and robust framework has the potential to advance the field of molecular dynamics simulations, making high-fidelity atomic simulations more accessible for large-scale studies in chemistry, materials science, and related disciplines.
GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL
Reinforcement Learning
- GEM introduces a candidate-based action selection interface that enhances decision-making in offline RL.
- The framework employs a GMM actor trained with advantage-weighted EM-style updates to maintain multimodal action distributions.
- Inference is guided by a scoring rule that balances uncertainty and support, enabling stable deployment across states.
- GEM allows for a flexible candidate budget, improving decision quality without requiring retraining.
Read more
GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL
Summary
The paper introduces GEM (Guided Expectation-Maximization), an innovative framework for action selection in offline reinforcement learning (RL) that addresses the challenges of multimodal action landscapes and distributional shift. Traditional offline RL methods often struggle with action selection due to the risk of out-of-distribution (OOD) queries, which can lead to unreliable value estimates. GEM tackles this by employing a Gaussian Mixture Model (GMM) actor that is trained using critic-guided, advantage-weighted updates. This approach preserves distinct action components while directing probability mass towards high-value regions. During inference, GEM utilizes a candidate-based selection mechanism, generating a set of plausible actions and reranking them based on a conservative ensemble lower-confidence bound and behavior-normalized support. This design allows for stable control across varying candidate budgets and improves decision quality without the need for retraining. Empirical results demonstrate that GEM performs competitively on D4RL benchmarks, providing a flexible compute-quality tradeoff through its candidate selection process.
Methodology
GEM employs a Gaussian Mixture Model (GMM) actor trained through critic-guided, advantage-weighted Expectation-Maximization updates. The inference process involves generating a candidate set of actions and reranking them using a conservative ensemble lower-confidence bound combined with behavior-normalized support, ensuring stable and comparable control across different states.
Results
GEM demonstrated competitive performance across various D4RL benchmarks, effectively addressing the challenges of action selection in offline RL. The framework's ability to adjust the candidate budget allows for a tradeoff between computational resources and decision quality, enhancing the practical deployment of offline RL agents.
Implications
The GEM framework has significant implications for the deployment of offline RL systems, particularly in scenarios where computational resources are limited or where decision quality is critical. Its candidate-based approach could be applied in various domains requiring robust decision-making under uncertainty, such as robotics and autonomous systems.
Conditionally Identifiable Latent Representation for Multivariate Time Series with Structural Dynamics
Time Series
- Introduction of the Identifiable Variational Dynamic Factor Model (iVDFM) for multivariate time series.
- Achieves identifiability by conditioning the innovation process rather than latent states.
- Preserves identifiability through linear diagonal dynamics, avoiding traditional rotation ambiguities.
- Demonstrates improved factor recovery and stable intervention accuracy on synthetic and real-world data.
Read more
Conditionally Identifiable Latent Representation for Multivariate Time Series with Structural Dynamics
Summary
This paper introduces the Identifiable Variational Dynamic Factor Model (iVDFM), which aims to extract identifiable latent factors from multivariate time series data while ensuring interpretability and stability across different contexts. The authors address the challenge of identifiability in dynamic settings, where traditional dynamic factor models (DFMs) are limited by their inability to provide unique semantic interpretations of factors due to orthogonal rotation ambiguities. The iVDFM achieves identifiability by conditioning the innovation process—responsible for driving the dynamics—on auxiliary variables and a deterministic regime embedding. This approach allows for the identification of innovations up to permutation and component-wise affine transformations. The model employs linear diagonal dynamics to map these innovations to factors, maintaining computational efficiency and scalability. The authors validate their model through experiments on synthetic data, demonstrating improved factor recovery, stable intervention accuracy in structural causal models, and competitive performance in probabilistic forecasting on real-world datasets.
Methodology
The iVDFM employs a variational inference approach to learn latent factors from multivariate time series. It conditions the innovation process on observed auxiliary variables and a deterministic regime embedding to ensure identifiability. The model uses linear diagonal dynamics to map innovations to identifiable factors, allowing for scalable computation through companion-matrix and Krylov methods.
Results
The iVDFM shows significant improvements in factor recovery on synthetic datasets, maintains stable intervention accuracy in synthetic structural causal models, and achieves competitive results in probabilistic forecasting tasks on real-world benchmarks.
Implications
The findings suggest that the iVDFM can be effectively used in various fields such as macroeconomics and medicine, where understanding latent temporal dynamics is crucial. The model's identifiability guarantees enhance its applicability in causal inference and intervention analysis, providing clearer insights into the underlying causal structures of time series data.
Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
Reinforcement Learning
Theory
- The paper identifies and formalizes the differences between two interpretations of the TD error in deep RL.
- Nonlinear deep RL architectures can lead to significant discrepancies in TD error calculations.
- Choosing one interpretation of the TD error over the other can impact the performance of RL algorithms.
- The findings challenge the conventional understanding of TD error in deep RL settings.
Read more
Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
Summary
This paper explores the temporal difference (TD) error in deep reinforcement learning (RL), highlighting the differences between its two interpretations: the difference between temporally successive predictions and the difference between a bootstrapped target and a prediction. The authors demonstrate that as deep RL architectures become more nonlinear, these interpretations can yield significantly different numerical values. This discrepancy has implications for the performance of deep RL algorithms, particularly those that rely on the TD error for computing other quantities, such as in deep differential RL methods. The work provides a formal characterization of these differences and emphasizes that the conventional interpretation of the TD error may not always be valid in deep RL contexts.
Methodology
The authors conducted both analytical and empirical explorations to characterize the differences between the two interpretations of the TD error. They analyzed the impact of nonlinearities in deep RL architectures on TD error calculations and assessed how these differences affect algorithm performance.
Results
The study found that increasingly nonlinear architectures can cause the two interpretations of the TD error to diverge significantly. This divergence can lead to performance variations in deep RL algorithms that utilize the TD error for further computations, particularly in average-reward settings.
Implications
The findings suggest that researchers and practitioners in deep reinforcement learning should reconsider the standard interpretation of the TD error, especially when designing algorithms that rely on this concept. This could lead to improved algorithm performance and a deeper understanding of the underlying mechanisms in RL.
Mechanisms of Introspective Awareness
NLP
Large Language Models
Interpretability
- Introspective awareness in LLMs is behaviorally robust with 0% false positives.
- Detection capability emerges from post-training, not pretraining.
- Anomaly detection involves distributed computation across multiple directions.
- Ablating refusal directions significantly enhances detection rates.
Read more
Mechanisms of Introspective Awareness
Summary
This paper investigates the mechanisms underlying the introspective awareness of large language models (LLMs), specifically their ability to detect and identify injected steering vectors in their residual stream. The authors establish three main findings: first, the detection capability is behaviorally robust, achieving moderate true positive rates with 0% false positives across various prompts, and emerges specifically from post-training rather than pretraining. Second, the introspective capability is not attributable to a single linear confound; instead, it relies on distributed multi-layer perceptron (MLP) computation across multiple directions, utilizing evidence carrier and gate features. Third, the models exhibit greater introspective capacity than what is typically elicited; ablating refusal directions significantly enhances detection rates, indicating that introspection can be improved in future models. Overall, the results suggest that introspective awareness in LLMs is grounded in complex internal anomaly detection mechanisms rather than superficial heuristics.
Methodology
The authors conducted a series of behavioral experiments and causal interventions on open-source LLMs. They injected steering vectors representing various concepts into the models and assessed their ability to detect and identify these injections through a defined experimental setup. Metrics such as detection rate, false positive rate, and introspection rate were used to evaluate performance.
Results
The findings revealed that the models could detect injected concepts with moderate true positive rates and no false positives. The introspective capability was shown to be robust and emerged from post-training. The analysis indicated that detection and identification mechanisms are distinct and that the models possess latent introspective capacities that can be significantly enhanced through specific interventions.
Implications
Understanding the mechanisms of introspective awareness in LLMs has significant implications for improving model reliability and alignment. Enhanced introspective capabilities could allow for more transparent querying of models regarding their beliefs and uncertainties, which is crucial for developing trustworthy AI systems.
SpecXMaster Technical Report
Reinforcement Learning
- SpecXMaster automates NMR spectral interpretation using Agentic Reinforcement Learning.
- The framework processes raw FID data directly, improving accuracy and efficiency.
- It has shown superior performance on public NMR interpretation benchmarks.
- Iterative evaluations by experts have refined the system's capabilities.
Read more
SpecXMaster Technical Report
Summary
The SpecXMaster Technical Report presents an innovative framework for the automated interpretation of Nuclear Magnetic Resonance (NMR) spectra, addressing significant challenges in conventional expert-dependent methods. Traditional spectral interpretation is hindered by human bias, variability, and a steep learning curve, which can lead to errors and inefficiencies in chemical research. SpecXMaster leverages Agentic Reinforcement Learning (RL) to automate the extraction of multiplicity information from both ¹H and ¹³C spectra directly from raw Free Induction Decay (FID) data. This end-to-end pipeline facilitates the interpretation of NMR spectra into chemical structures without the need for manual intervention. The framework has been validated against multiple public NMR interpretation benchmarks, demonstrating superior performance compared to existing methods. Iterative evaluations by professional spectroscopists have further refined its capabilities. SpecXMaster represents a paradigm shift in spectral interpretation, promising to enhance the efficiency and accuracy of organic chemistry research by reducing reliance on human expertise and enabling high-throughput analysis.
Methodology
SpecXMaster employs an end-to-end pipeline that directly interfaces with raw FID data, utilizing advanced signal processing techniques to automate the extraction of quantitative peak parameters and multiplicity information from NMR spectra. The framework is built on Agentic Reinforcement Learning, allowing it to learn and adapt through iterative feedback.
Results
The results indicate that SpecXMaster outperforms existing NMR interpretation methods across multiple benchmarks, demonstrating its ability to accurately interpret complex spectral data into chemical structures. The framework has been validated through expert evaluations, confirming its reliability and effectiveness in practical applications.
Implications
SpecXMaster has the potential to revolutionize the field of organic chemistry by enabling fully automated spectral interpretation, thereby accelerating the pace of scientific discovery and reducing the dependency on specialized expertise. This could lead to more efficient research workflows and improved reproducibility in chemical analysis.
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
Generative Models
Theory
- Introduces a formal statistical framework for diffusion models on low-dimensional manifolds.
- Develops a novel score decomposition approach for analyzing score functions under different noise levels.
- Constructs neural network architectures tailored for effective score function approximation.
- Establishes statistical rates for score estimation and distribution learning based on manifold curvature and intrinsic dimensionality.
Read more
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
Summary
This paper explores the theoretical foundations of diffusion models in generative modeling, particularly for high-dimensional data that resides on low-dimensional manifolds. The authors investigate how diffusion models learn structured data by focusing on statistical complexity and the geometric properties of data. They propose a score decomposition approach that accounts for varying noise levels, revealing intrinsic structures of score functions under both large and small noise conditions. The study highlights the influence of manifold curvature on score function approximation and provides statistical rates for score estimation and distribution learning, demonstrating that these rates are governed by the intrinsic dimension of the data and the curvature of the manifold. The findings bridge theoretical insights with practical applications in generative modeling, enhancing the understanding of diffusion models' performance on manifold data.
Methodology
The authors model data as samples from a smooth Riemannian manifold and analyze score functions in diffusion models through a score decomposition approach. They derive theoretical results regarding the behavior of score functions under varying noise levels and construct neural network architectures to approximate these functions effectively. Statistical rates for score estimation and distribution learning are also derived based on the intrinsic dimensionality and curvature of the manifold.
Results
The paper presents key theoretical results, including the identification of intrinsic structures of score functions for manifold data, the construction of suitable neural network architectures for score approximation, and the establishment of statistical rates for estimating the score function and the underlying data distribution. These results indicate that the performance of diffusion models is significantly influenced by the intrinsic dimension of the data and the curvature of the manifold.
Implications
The findings have significant implications for enhancing the performance of diffusion models in generative tasks, particularly in applications involving complex, high-dimensional data with low-dimensional structures. This work provides a deeper theoretical understanding that can inform the design of more effective generative models and improve their application across various domains, including image synthesis, audio generation, and natural language processing.
Constrained Online Convex Optimization with Memory and Predictions
Optimization
Theory
- Introduction of COCO-M framework for constrained online convex optimization with memory.
- Development of algorithms achieving sublinear regret and constraint violation under time-varying constraints.
- Adaptive penalty approach for scenarios without predictions.
- Optimistic algorithm designed for cases with predictions, improving performance with prediction accuracy.
Read more
Constrained Online Convex Optimization with Memory and Predictions
Summary
This paper introduces Constrained Online Convex Optimization with Memory (COCO-M), a framework where both loss and constraints depend on a finite history of past decisions. This approach extends previous work on unconstrained online optimization with memory and addresses practical applications like constrained dynamical systems and scheduling. The authors propose novel algorithms that achieve sublinear regret and cumulative constraint violation under time-varying constraints, with and without predictions of future losses and constraints. An adaptive penalty method is introduced for scenarios without predictions, while an optimistic algorithm is developed for cases with unreliable short-horizon predictions. The results bridge classical constrained online convex optimization and memory-dependent settings, providing a versatile learning toolbox for various applications.
Methodology
The authors analyze COCO-M through two main problem instances: one with memory effects on both losses and constraints (COCO-M2) and another with memory-less constraints (COCO-M). They employ a penalty-based relaxation analysis to derive algorithms that ensure sublinear regret and cumulative constraint violation. Additionally, they explore optimistic learning techniques to leverage predictions about future losses and constraints, adjusting their algorithms based on prediction accuracy.
Results
The proposed algorithms achieve regret bounds of O(m^(3/2)√T log T) and cumulative constraint violation bounds of O(max{T^(3/4), m^(3/2)√T log T}) for COCO-M2. For COCO-M, the cumulative constraint violation is improved to O(T^(3/4)) and O(m^(3/2)√T log T) for short memory scenarios. In the presence of predictions, the regret and violation bounds improve significantly, reaching O(log T) and O(m log T) for perfect predictions, while maintaining robustness under inaccurate predictions.
Implications
The findings have significant implications for various fields requiring optimization under constraints, such as control systems, resource management, and scheduling. The COCO-M framework provides a robust approach to tackle real-world problems where past decisions influence current outcomes, enhancing decision-making processes in dynamic environments.
From Causal Discovery to Dynamic Causal Inference in Neural Time Series
Time Series
- DCNAR integrates causal discovery with time-varying causal inference, addressing the challenge of unknown causal structures.
- The framework emphasizes interpretability and stability of causal inferences over mere predictive accuracy.
- Behavioral diagnostics are used to evaluate the scientific validity of the model, focusing on causal necessity and temporal stability.
- Experiments show that DCNAR outperforms traditional methods in terms of stability and meaningfulness of causal inferences.
Read more
From Causal Discovery to Dynamic Causal Inference in Neural Time Series
Summary
This paper introduces Dynamic Causal Network Autoregression (DCNAR), a novel two-stage neural causal modeling framework designed to address the limitations of existing dynamic causal inference methods that assume a known causal structure. The first stage of DCNAR employs a neural autoregressive causal discovery model to learn a sparse directed causal network from multivariate time series data. In the second stage, this learned structure serves as a structural prior for a time-varying neural network autoregression, allowing for dynamic estimation of causal influences without requiring pre-specified network structures. The authors evaluate DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural changes, rather than relying solely on predictive accuracy. Experiments conducted on multi-country panel time-series data demonstrate that DCNAR produces more stable and behaviorally meaningful dynamic causal inferences compared to traditional coefficient-based or structure-free alternatives, even when forecasting performance is comparable. This positions DCNAR as a robust framework for utilizing AI in scientific reasoning about dynamic causal relationships under conditions of structural uncertainty.
Methodology
DCNAR consists of two stages: first, a neural autoregressive model learns a sparse directed causal network from multivariate time series data; second, this learned network structure is used as a structural prior in a time-varying neural network autoregression to estimate dynamic causal influences.
Results
The results indicate that DCNAR achieves competitive predictive performance while providing stable and interpretable impulse responses and counterfactual analyses, even in the absence of a known causal structure. The framework demonstrates superior stability and theoretical coherence in inferred causal dynamics compared to existing methods.
Implications
DCNAR can be applied in various scientific fields, particularly in social sciences and economics, where understanding dynamic causal relationships is crucial for evaluating interventions and policy impacts. It offers a new approach to causal analysis in complex adaptive systems characterized by evolving structures.
COMPASS-Hedge: Learning Safely Without Knowing the World
Theory
Optimization
- COMPASS-Hedge achieves minimax-optimal regret in adversarial environments.
- It provides instance-optimal, gap-dependent regret in stochastic settings.
- The algorithm maintains near-constant regret relative to a designated baseline policy.
- COMPASS-Hedge is parameter-free and does not require prior knowledge of the environment.
Read more
COMPASS-Hedge: Learning Safely Without Knowing the World
Summary
The paper introduces COMPASS-Hedge, a novel online learning algorithm designed to address the fundamental trilemma in online learning: achieving minimax-optimal regret in adversarial settings, instance-optimal regret in stochastic environments, and baseline safety against a fixed comparator. Previous algorithms typically excelled in one or two of these areas but struggled to unify all three without sacrificing performance or requiring prior knowledge of the environment. COMPASS-Hedge is the first full-information method that achieves these goals simultaneously, operating without parameters and without needing prior knowledge of the environment's characteristics. The algorithm employs a unique combination of adaptive pseudo-regret scaling, phase-based aggression, and a comparator-aware mixing strategy. This approach establishes a new benchmark in online learning, demonstrating that baseline safety can coexist with robust performance in both adversarial and stochastic contexts.
Methodology
The authors developed COMPASS-Hedge by integrating adaptive pseudo-regret scaling with phase-based aggression and a comparator-aware mixing strategy. This combination allows the algorithm to dynamically adjust its learning strategy based on the observed performance of various expert policies, ensuring robust performance across different learning environments.
Results
The paper presents theoretical guarantees showing that COMPASS-Hedge achieves the desired regret bounds in both adversarial and stochastic settings, while also ensuring baseline safety. Numerical experiments further validate the algorithm's performance, demonstrating its effectiveness in practical scenarios.
Implications
The introduction of COMPASS-Hedge has significant implications for online learning applications in critical domains such as finance and healthcare, where safety and robustness are paramount. The ability to learn effectively without prior knowledge of the environment enhances the applicability of online learning algorithms in real-world scenarios.
DAQ: Delta-Aware Quantization for Post-Training LLM Weight Compression
Large Language Models
Efficient ML
Optimization
- DAQ preserves critical post-training knowledge by focusing on small-magnitude parameter updates.
- The framework employs delta-aware metrics instead of traditional reconstruction loss to optimize quantization.
- DAQ is data-free, requiring only the base and post-trained weight matrices for quantization.
- Preliminary results show that DAQ can recover capabilities lost in standard quantization while maintaining performance.
Read more
DAQ: Delta-Aware Quantization for Post-Training LLM Weight Compression
Summary
The paper introduces Delta-Aware Quantization (DAQ), a novel framework for post-training quantization of large language models (LLMs) that aims to preserve the critical knowledge acquired during fine-tuning. Traditional quantization methods focus on minimizing reconstruction error, which can lead to the loss of important small-magnitude parameter updates (∆W) that encode post-training behavior. DAQ addresses this issue by employing two delta-aware metrics—Sign Preservation Rate and Cosine Similarity—that optimize the directional fidelity of ∆W, thus ensuring that the quantization process retains the essential updates while minimizing the impact of quantization noise. The framework is data-free, requiring only the base and post-trained weight matrices, making it more efficient than calibration-based methods. In preliminary experiments using FP8 quantization, DAQ successfully recovers style-specific capabilities that are often lost with standard quantization techniques, while maintaining overall model performance. The authors provide an open-source implementation of DAQ as part of the AngelSlim toolkit for large model compression.
Methodology
The DAQ framework replaces conventional reconstruction-based objectives with delta-aware metrics—Sign Preservation Rate and Cosine Similarity. It formulates the quantization problem in terms of the differences between base and post-trained weights, optimizing quantization hyperparameters to maximize the preservation of these differences. The method is implemented using a scale-parameterized quantize-dequantize operator, allowing for flexibility in quantization schemes.
Results
In pilot studies, DAQ demonstrated the ability to recover style-specific capabilities that were lost under standard quantization methods, while also maintaining the general performance of the model. The use of delta-aware metrics led to improved preservation of the fine-tuning information encoded in small-magnitude updates.
Implications
The DAQ framework has significant implications for the efficient deployment of large language models, particularly in resource-constrained environments where model size and computational efficiency are critical. By preserving essential post-training knowledge, DAQ can enhance the usability of quantized models in practical applications.
SynForceNet: A Force-Driven Global-Local Latent Representation Framework for Lithium-Ion Battery Fault Diagnosis
Theory
Optimization
Time Series
- Introduces SynForceNet, a novel framework for online battery fault diagnosis.
- Combines kernel one-class classification with minimum-volume estimation for anomaly detection.
- Achieves significant improvements in diagnostic performance metrics compared to baseline methods.
- Explores the spatial separation of fault representations and enhances robustness through manifold learning.
Read more
SynForceNet: A Force-Driven Global-Local Latent Representation Framework for Lithium-Ion Battery Fault Diagnosis
Summary
This paper presents SynForceNet, an innovative online fault diagnosis network for lithium-ion batteries in electric vehicles (EVs), addressing the challenges of detecting faults under complex and rare safety-critical conditions. The proposed framework integrates a deep anomaly detection approach that combines kernel one-class classification with minimum-volume estimation. Key enhancements include the introduction of mechanical constraints and spike-timing-dependent plasticity (STDP)-based dynamic representations, which improve fault characterization and create a more compact normal-state boundary. The methodology is validated using a substantial dataset of 8.6 million data points collected from 20 EVs, demonstrating significant performance improvements over existing baseline methods. Specifically, the proposed approach achieves average enhancements of 7.59% in true positive rate (TPR), 27.92% in positive predictive value (PPV), 18.28% in F1 score, and 23.68% in area under the curve (AUC). The study also explores the spatial separation of fault representations and enhances robustness by learning the manifold structure in the latent space, suggesting shared causal structures across different fault types. This work highlights the potential of integrating deep learning with physical constraints and neural dynamics for effective battery safety diagnosis.
Methodology
The methodology involves a deep anomaly detection framework that utilizes kernel one-class classification and minimum-volume estimation. It incorporates mechanical constraints and STDP-based dynamic representations to enhance fault characterization and define a compact normal-state boundary.
Results
The proposed method was validated with 8.6 million data points, achieving average improvements of 7.59% in TPR, 27.92% in PPV, 18.28% in F1 score, and 23.68% in AUC compared to advanced baseline methods. The analysis of fault representations before and after modeling indicated enhanced spatial separation and robustness.
Implications
The findings suggest that integrating deep learning with physical constraints and neural dynamics can significantly improve the robustness and accuracy of battery fault diagnosis, which is crucial for ensuring the safety of lithium-ion batteries in electric vehicles.
SafeSeek: Universal Attribution of Safety Circuits in Language Models
NLP
Large Language Models
Interpretability
- SafeSeek provides a unified framework for discovering safety circuits in LLMs, overcoming limitations of heuristic search methods.
- The framework reveals that safety behaviors are often governed by highly sparse circuits, which are structurally distinct from general utility components.
- SafeSeek enables precise enhancement or removal of safety abilities through its optimization-based approach.
- The empirical validation shows significant reductions in attack success rates while preserving general model capabilities.
Read more
SafeSeek: Universal Attribution of Safety Circuits in Language Models
Summary
The paper introduces SafeSeek, a novel framework for the mechanistic interpretability of safety circuits in Large Language Models (LLMs). Existing methods for safety attribution often lack generalization and reliability due to their reliance on heuristic metrics and isolated component analysis. SafeSeek addresses these limitations by employing a gradient-based optimization approach to identify functionally complete safety circuits through the use of differentiable binary masks. This allows for the extraction of multi-granular circuits, enhancing the interpretability of safety behaviors such as backdoor attacks and safety alignment. The authors validate SafeSeek through experiments on LLaMA-3.1-8B-Instruct and Qwen-3-8B, demonstrating its effectiveness in identifying sparse safety circuits that significantly impact model safety without compromising general utility. The framework also introduces Safety Circuit Tuning (SaCirT), an efficient fine-tuning method that updates only the identified safety circuits, thereby improving LLM safety while maintaining performance.
Methodology
SafeSeek reformulates the discovery of safety circuits as a gradient-based optimization problem using differentiable binary masks. This approach allows for the extraction of functional safety subgraphs while maintaining end-to-end differentiability. The framework includes Safety Circuit Tuning (SaCirT) to fine-tune only the identified safety circuits, enhancing LLM safety without affecting general performance.
Results
The experiments demonstrated that SafeSeek can identify a backdoor circuit with 0.42% sparsity, reducing the Attack Success Rate (ASR) from 100% to 0.4% while retaining over 99% of general utility. Additionally, an alignment circuit was localized with 3.03% heads and 0.79% neurons, whose removal caused ASR to spike from 0.8% to 96.9%. The application of SaCirT maintained 96.5% safety retention during helpfulness optimization.
Implications
The findings suggest that SafeSeek can significantly enhance the safety of LLMs by providing a transparent and controllable method for identifying and manipulating safety circuits. This has potential applications in high-stakes environments where LLMs are deployed, ensuring alignment with human values and robustness against adversarial threats.
Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
Time Series
- The study benchmarks ML climate emulators under strictly historical training conditions.
- An accuracy vs. stability trade-off is identified, with ClimaX showing lower absolute error but higher sensitivity to distribution shifts.
- Simpler CNN architectures demonstrate greater stability compared to high-capacity models.
- A temperature-precipitation disparity is observed, indicating different levels of model robustness.
Read more
Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
Summary
This paper investigates the robustness of machine learning (ML) climate emulators under no-analog distribution shifts, which occur when future climate states diverge significantly from historical training data. The authors benchmark three state-of-the-art architectures—U-Net, ConvLSTM, and ClimaX—using a controlled experimental setup that restricts training to historical data from 1850 to 2014. They evaluate the models through two strategies: temporal extrapolation to recent climate data (2015-2023) and cross-scenario forcing shifts across divergent emission pathways (SSP1-2.6 and SSP5-8.5). The findings reveal an accuracy versus stability trade-off, where the ClimaX model, despite achieving the lowest absolute error, shows greater sensitivity to distribution shifts, particularly in precipitation projections. In contrast, simpler CNN-based models exhibit more relative stability. The study highlights a disparity in temperature and precipitation projections, suggesting that high-capacity models may struggle to generalize under changing climate conditions. The authors emphasize the need for scenario-aware training and rigorous out-of-distribution evaluation protocols to enhance the reliability of climate emulators.
Methodology
The authors conducted a systematic benchmarking of three ML architectures (U-Net, ConvLSTM, ClimaX) by isolating a historical-only training regime and evaluating model performance through temporal extrapolation and cross-scenario forcing shifts. This approach allowed for a controlled assessment of out-of-distribution robustness.
Results
The analysis revealed that while the ClimaX model achieved the lowest absolute error, it exhibited a significant increase in precipitation errors (up to 8.44%) under extreme forcing scenarios. Simpler CNN models maintained more stable performance across distribution shifts. A notable disparity was found in temperature and precipitation projections, with models showing stable temperature forecasts but larger degradation in precipitation accuracy.
Implications
The results suggest that even advanced ML models may not reliably generalize under future climate conditions, emphasizing the need for improved training protocols and evaluation methods. This has significant implications for climate modeling and the development of reliable climate emulators for policy-making and adaptation strategies.
Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning
Reinforcement Learning
Robotics
Optimization
- Introduction of Budget-Conditioned Reachability framework for safe offline RL.
- Decoupling of reward maximization from cumulative safety cost constraints.
- Dynamic budgets used to prune unsafe actions and guide value estimation.
- BCRL integrates with existing offline RL algorithms, enhancing their safety without requiring online interactions.
Read more
Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning
Summary
This paper addresses the challenge of balancing reward maximization and safety constraints in offline reinforcement learning (RL) by introducing a novel framework called Budget-Conditioned Reachability (BCRL). Traditional methods often focus on hard safety constraints, which can lead to optimization instability. The authors propose a safety-conditioned reachability set that decouples reward maximization from cumulative safety cost constraints, allowing for a more flexible approach to safety in RL. The BCRL framework utilizes dynamic budgets to maintain a persistently safe state-action set, enabling agents to learn safe policies from fixed datasets without requiring further environment interaction. The authors demonstrate the effectiveness of their method through experiments on standard offline safe-RL benchmarks and a real-world maritime navigation task, showing that BCRL matches or outperforms state-of-the-art baselines while ensuring safety throughout the learning process.
Methodology
The authors define a safety-conditioned reachability set that allows for the enforcement of safety constraints without unstable optimization techniques. They introduce dynamic budgets that adjust during policy execution, enabling the pruning of unsafe actions at each time step. The BCRL algorithm is evaluated on various benchmarks, including grid-world environments and a maritime navigation task, demonstrating its applicability across different settings.
Results
The experiments conducted reveal that BCRL consistently matches or outperforms existing state-of-the-art offline safe RL methods. The approach effectively maintains safety while optimizing for rewards, showcasing its robustness in both controlled and real-world scenarios.
Implications
The proposed BCRL framework has significant implications for real-world applications where safety is paramount, such as autonomous navigation and robotics. By allowing agents to learn from historical data without unsafe exploration, it enhances the feasibility of deploying RL agents in complex environments.
Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework
Theory
Optimization
- CR-FWI enhances robustness against initial model inaccuracies and poor seismic data quality.
- The wave-based NTK framework provides a theoretical understanding of the dynamic behavior of CR-FWI.
- Eigenvalue decay properties of the wave-based NTK explain the slower high-frequency convergence of CR-FWI.
- The proposed IG-FWI method achieves a better trade-off between robustness and convergence rate.
Read more
Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework
Summary
This paper addresses the challenges of Full-Waveform Inversion (FWI), a technique used to estimate physical parameters from seismic data, which is often hindered by its sensitivity to initial model accuracy. The authors introduce a Continuous Representation FWI (CR-FWI) framework that utilizes implicit neural representations (INR) to improve robustness against initial model inaccuracies. Despite its advantages, CR-FWI exhibits slower convergence rates, particularly in high-frequency components. The authors develop a theoretical foundation by extending the neural tangent kernel (NTK) to establish a wave-based NTK framework, revealing that the wave-based NTK is dynamic and not constant during training, which is a departure from standard NTK behavior. This dynamic nature, along with the eigenvalue decay properties of the wave-based NTK, provides insights into the robustness and convergence characteristics of CR-FWI. The paper proposes several CR-FWI methods, including a hybrid representation termed IG-FWI, which balances robustness and convergence rates. Experimental results demonstrate the superior performance of these methods in various geophysical exploration scenarios compared to conventional FWI and existing INR-based methods.
Methodology
The authors extend the neural tangent kernel (NTK) to create a wave-based NTK framework for FWI, analyzing its dynamic behavior and eigenvalue decay properties. They propose several CR-FWI methods, including IG-FWI, which combines INR with multi-resolution grids.
Results
The proposed CR-FWI methods, particularly IG-FWI, demonstrate superior performance in geophysical exploration tasks, effectively mitigating the challenges of initial model sensitivity and achieving better convergence rates compared to conventional FWI and existing INR-based methods.
Implications
This work has significant implications for geophysical exploration, medical imaging, and other fields where accurate subsurface modeling is critical. The insights gained from the wave-based NTK framework could lead to more robust inversion techniques and improved data interpretation.
ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography
Generative Models
Graph Learning
Time Series
- ST-GDance++ decouples spatial and temporal dependencies for efficient group choreography generation.
- Lightweight distance-aware graph convolutions are used to capture inter-dancer relationships with reduced computational cost.
- A diffusion noise scheduling strategy enhances the generation of long-duration motion sequences.
- The framework significantly reduces latency while maintaining competitive generation quality.
Read more
ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography
Summary
The paper presents ST-GDance++, a novel framework designed to enhance the generation of group dance choreography from music. Traditional models struggle with the computational complexity and coordination required for multi-dancer scenarios, particularly as the number of dancers and the length of sequences increase. ST-GDance++ addresses these challenges by decoupling spatial and temporal dependencies, allowing for more efficient choreography generation. The authors introduce lightweight distance-aware graph convolutions to model inter-dancer relationships while minimizing computational overhead. Additionally, a diffusion noise scheduling strategy and an efficient temporal-aligned attention mask are implemented to facilitate stream-based generation for long motion sequences. Experiments conducted on the AIOZ-GDance dataset demonstrate that ST-GDance++ achieves competitive generation quality with significantly reduced latency compared to existing methods, marking a substantial improvement in the scalability and stability of multi-dancer choreography generation.
Methodology
The methodology involves a two-pronged approach: spatial modeling through lightweight distance-aware graph convolutions to efficiently capture inter-dancer relationships, and temporal modeling using a diffusion noise scheduling strategy combined with an efficient temporal-aligned attention mask. This allows for the generation of long motion sequences in a scalable manner, reducing the computational complexity associated with traditional methods.
Results
The results indicate that ST-GDance++ achieves a significant reduction in latency compared to existing group dance generation methods while maintaining high-quality output. The framework's ability to handle long-duration sequences and complex multi-dancer interactions is validated through experiments on the AIOZ-GDance dataset, showcasing its effectiveness in generating coherent and collision-aware choreography.
Implications
The implications of this research extend to various applications in film production, gaming, and animation, where realistic and synchronized group dance movements are essential. The framework can assist artists and developers in creating immersive experiences that align with musical compositions, enhancing the overall quality of visual performances.
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Reinforcement Learning
Interpretability
Robotics
- Introduction of TREX, a trajectory-based explainability framework for MORL.
- Quantitative analysis of behavioral patterns influencing objective trade-offs.
- Demonstration of TREX's applicability in standard MORL environments.
- Ability to cluster trajectories into meaningful segments for better understanding.
Read more
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Summary
The paper introduces TREX, a novel framework designed to enhance explainability in Multi-Objective Reinforcement Learning (MORL) by focusing on trajectory-based explanations. Traditional reinforcement learning often struggles with multiple conflicting objectives, leading to a lack of clarity in decision-making processes. TREX addresses this gap by generating trajectories from an expert policy across various user preferences and clustering these trajectories into semantically meaningful segments. This allows for a detailed analysis of how specific behaviors influence the trade-offs between objectives. The authors demonstrate the effectiveness of TREX through experiments in multi-objective MuJoCo environments, such as HalfCheetah, Ant, and Swimmer, showcasing its ability to isolate and quantify behavioral patterns that impact the Pareto trade-off. The framework not only provides qualitative insights but also quantifies the influence of different behaviors on achieving specific objectives, thereby enhancing the interpretability of MORL policies.
Methodology
TREX generates trajectories from an expert MORL policy based on different user preferences. It clusters these trajectories into temporal segments to analyze the influence of specific behaviors on the objectives. The framework measures the impact of excluding certain behavioral clusters by training complementary policies and assessing the deviations in rewards and actions compared to the expert policy.
Results
Experiments conducted in multi-objective MuJoCo environments demonstrated that TREX effectively isolates and quantifies specific behavioral patterns. The framework successfully identified how different behaviors contribute to the trade-offs between objectives, providing both qualitative and quantitative insights into the decision-making process of MORL agents.
Implications
TREX has significant implications for the deployment of RL agents in real-world applications, particularly in scenarios where understanding the rationale behind decisions is crucial for safety and trust. By providing clear explanations of how different objectives are prioritized, TREX can enhance user trust and facilitate better decision-making in complex environments.
The Coordinate System Problem in Persistent Structural Memory for Neural Architectures
Theory
- Introduction of the Dual-View Pheromone Pathway Network (DPPN) for persistent structural memory.
- Identification of coordinate stability and graceful transfer mechanisms as critical requirements for effective memory.
- Demonstration that fixed random Fourier features provide stable coordinates but do not ensure transfer advantage.
- Evidence that learning-rate modulation is more effective than routing bias for preventing negative transfer.
Read more
The Coordinate System Problem in Persistent Structural Memory for Neural Architectures
Summary
This paper introduces the Dual-View Pheromone Pathway Network (DPPN), a novel neural architecture designed to facilitate persistent structural memory through a pheromone field that influences routing in latent slot transitions. The author conducts five experiments to uncover two essential requirements for achieving persistent structural memory: (1) a stable coordinate system, which must be established prior to accumulating statistics, and (2) a graceful transfer mechanism that allows learned functions or learning-rate modulation to operate effectively without interference from incorrect priors. The experiments reveal several obstacles to achieving this memory, including pheromone saturation and coordinate incompatibility. While fixed random Fourier features provide stable coordinates, they do not guarantee improved transfer performance. The findings indicate that while coordinate stability is crucial, it alone is insufficient for effective transfer; the method of transfer also plays a significant role. The DPPN architecture demonstrates superior performance in within-task learning compared to traditional transformer models, highlighting its potential for enhancing structural memory in neural networks.
Methodology
The study employs a series of five experiments, progressively refining the architecture and addressing distinct obstacles related to persistent structural memory. Each experiment builds on the findings of the previous one, culminating in the identification of the coordinate system problem as a fundamental challenge. The experiments utilize various model variants and transfer targets, assessing performance through metrics such as Average Unweighted Learning Curve (AULC).
Results
The DPPN architecture outperforms transformer and random sparse baselines in within-task learning, achieving an AULC of 0.700 compared to 0.680 and 0.670 for the baselines. The introduction of learning-rate modulation eliminates negative transfer, while routing-bias pheromone consistently leads to performance degradation. The study also finds that while stable coordinates are necessary, they do not guarantee effective transfer without appropriate mechanisms.
Implications
The findings suggest that neural architectures can benefit from incorporating stable coordinate systems and effective transfer mechanisms to enhance memory capabilities. This has potential applications in various domains requiring efficient transfer learning and memory retention, such as robotics and multi-task learning.
Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms
Large Language Models
NLP
- LLMs demonstrate superior performance in missing data imputation for real-world datasets compared to traditional methods.
- The effectiveness of LLMs is closely tied to their pre-training on domain-specific patterns from large corpora.
- Traditional imputation methods outperform LLMs on synthetic datasets, highlighting the importance of semantic context.
- LLMs incur higher computational costs and time, presenting a trade-off between quality and efficiency.
Read more
Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms
Summary
This paper investigates the effectiveness of Large Language Models (LLMs) in the context of missing data imputation, a critical task in data analysis. The authors highlight the limitations of previous studies, which often faced challenges such as scalability, limited cross-model comparisons, and evaluations on small datasets. They propose a comprehensive benchmarking study that evaluates five popular LLMs against six state-of-the-art imputation methods across 29 datasets, including both synthetic and real-world data, under various missingness mechanisms (MCAR, MAR, MNAR) with missing rates up to 20%. The findings reveal that LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, outperform traditional imputation methods on real-world datasets, suggesting that their performance is influenced by prior exposure to domain-specific patterns. However, on synthetic datasets, traditional methods like MICE show better results, indicating that LLMs excel in semantic contexts rather than purely statistical ones. The study also identifies a trade-off between the quality of imputation provided by LLMs and the higher computational costs associated with their use. Overall, this research positions LLMs as promising tools for complex tabular data imputation while emphasizing the need for further exploration of their limitations and operational costs.
Methodology
The authors conducted a benchmarking study comparing five LLMs with six traditional imputation methods across 29 datasets. They employed a zero-shot prompt engineering approach to evaluate the models under different missingness mechanisms (MCAR, MAR, MNAR) and varying missing rates (up to 20%).
Results
The results indicated that LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently outperformed traditional imputation methods on real-world datasets. However, on synthetic datasets, traditional methods like MICE showed better performance, suggesting that LLMs' strengths lie in their semantic understanding rather than statistical reconstruction.
Implications
The findings suggest that LLMs can be effectively utilized for missing data imputation in complex datasets, potentially improving data quality in various applications. However, the higher costs associated with LLMs necessitate careful consideration of their use in practice.
Beyond the Mean: Distribution-Aware Loss Functions for Bimodal Regression
Theory
Optimization
- Proposes distribution-aware loss functions to address bimodal regression challenges.
- Integrates normalized RMSE with Wasserstein and Cramér distances for improved predictive modeling.
- Demonstrates significant reduction in Jensen-Shannon Divergence compared to standard methods.
- Establishes a new Pareto efficiency frontier in stability and fidelity for regression tasks.
Read more
Beyond the Mean: Distribution-Aware Loss Functions for Bimodal Regression
Summary
This paper addresses the challenge of estimating predictive confidence in machine learning models, particularly in scenarios where prediction errors follow a bimodal distribution. Traditional regression methods often assume a unimodal Gaussian noise distribution, leading to mean-collapse behavior that fails to capture the true nature of prediction errors. The authors propose a novel family of distribution-aware loss functions that integrate normalized RMSE with Wasserstein and Cramér distances. This approach allows standard deep regression models to recover bimodal distributions without the instability associated with Mixture Density Networks (MDNs). Through a comprehensive four-stage experimental evaluation, the proposed Wasserstein loss demonstrates improved stability and fidelity, achieving a 45% reduction in Jensen-Shannon Divergence on complex bimodal datasets while maintaining the robustness of standard regression losses like MSE. The findings suggest that the proposed framework is a reliable tool for estimating aleatoric uncertainty in AI systems, enhancing their trustworthiness.
Methodology
The authors developed a distribution-aware loss framework that combines normalized RMSE with statistical distance metrics (Wasserstein and Cramér distances). This framework treats regression targets as continuous probability measures, enabling the recovery of bimodal distributions without complex model architectures. The methodology was validated through a four-stage experimental protocol, including synthetic benchmarks and real-world tasks.
Results
The proposed Wasserstein loss function achieved a 45% reduction in Jensen-Shannon Divergence on complex bimodal datasets while matching the stability of traditional regression losses like MSE in unimodal tasks. The framework outperformed Mixture Density Networks in both fidelity and robustness, demonstrating its effectiveness in estimating predictive uncertainty.
Implications
The findings suggest that the proposed loss functions can significantly enhance the reliability of machine learning models in applications requiring accurate uncertainty estimation, such as medical diagnosis, autonomous driving, and other critical decision-making systems.
Does This Gradient Spark Joy?
Reinforcement Learning
Efficient ML
Theory
- Introduces the Kondo gate to optimize backward passes in policy gradient methods.
- Delight, a combination of advantage and surprisal, serves as a more effective signal for sample selection.
- The Kondo gate allows for significant computational savings while preserving learning quality.
- Demonstrated effectiveness on MNIST and transformer token reversal tasks.
Read more
Does This Gradient Spark Joy?
Summary
This paper introduces the Delightful Policy Gradient (DG) and the Kondo gate, a novel approach to optimizing the backward pass in policy gradient methods. Traditional policy gradient methods compute a backward pass for every sample, which is computationally expensive and often unnecessary, as many samples provide little learning value. The DG method assigns a 'delight' score to each sample, calculated as the product of advantage and surprisal, allowing for a more informed selection of samples for backward passes. The Kondo gate uses this delight score to determine whether to compute a backward pass based on a comparison with a compute price. By adapting the price, the Kondo gate effectively traces a quality-cost Pareto frontier, allowing for significant reductions in computational cost while maintaining learning quality. Experiments on MNIST and transformer token reversal demonstrate that the Kondo gate can skip most backward passes while retaining nearly all learning quality, particularly as problem complexity increases. This suggests a new paradigm for training that emphasizes speculative decoding, where cheaper forward passes can screen samples before committing to expensive backpropagation.
Methodology
The paper employs the Delightful Policy Gradient (DG) method to assign a delight score to each sample, which informs the decision to compute a backward pass via the Kondo gate. The Kondo gate uses a Bernoulli sampling approach based on the delight score and a compute price to selectively perform backward passes, thus optimizing the learning process.
Results
In experiments, the Kondo gate was able to match the performance of full DG while using only 3% of the backward passes on the MNIST dataset. It significantly reduced computational costs while maintaining learning quality, particularly in more complex tasks. The Kondo gate also demonstrated robustness against approximate delight estimation.
Implications
The findings suggest that by selectively computing backward passes based on learning value, training efficiency can be greatly improved. This approach could be applied to various reinforcement learning scenarios, potentially leading to faster training times and reduced computational resource requirements.
Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks
Theory
Optimization
- Introduces a differential-geometric framework for analyzing shallow neural networks through quotient spaces.
- Characterizes the symmetry and quotient structure of shallow-network parameters, leading to a natural metric.
- Demonstrates that effective curvature can be defined on the quotient manifold, removing degeneracy from Hessians.
- Establishes that only horizontal parameter motions contribute to predictor evolution, while vertical motions are gauge variations.
Read more
Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks
Summary
This paper addresses the challenges posed by parameter redundancy in overparameterized shallow neural networks, where different parameter vectors can represent the same predictor due to symmetries such as hidden-unit permutations and rescalings. The authors propose a differential-geometric framework that analyzes shallow networks through a quotient space that accounts for these parameter symmetries. They characterize the symmetry and quotient structure of shallow-network parameters and demonstrate that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that captures intrinsic local geometry by removing degeneracy along symmetry orbits. The study of gradient flows on the quotient reveals that only the horizontal component of parameter motion affects predictor evolution, while the vertical component corresponds to gauge variation. The authors argue for an implicit-bias perspective at the quotient level, suggesting that complexity should be assigned to predictor classes rather than individual parameter representatives. Their experiments confirm that ambient flatness is representation-dependent and that local dynamics are better organized by quotient-level curvature summaries. The findings support the notion that the natural state space for symmetric shallow networks is the quotient space of predictor classes, rather than the raw parameter space.
Methodology
The authors develop a differential-geometric framework that involves characterizing parameter symmetries and constructing a quotient space. They analyze the induced metric on this space and study gradient flows, separating horizontal and vertical motions. The framework is applied to a quadratic-activation model, allowing for concrete theoretical and numerical analysis.
Results
The study finds that the effective curvature on the quotient manifold provides a clearer understanding of the local geometry of shallow networks. Experiments validate that ambient flatness is dependent on representation and that the dynamics of learning are better captured by quotient-level curvature. The implicit bias is more naturally described in the context of quotient coordinates, suggesting a shift in perspective on complexity in neural networks.
Implications
This work has implications for the optimization and understanding of neural networks, suggesting that analyzing them through their quotient spaces can lead to better insights into their behavior, performance, and the nature of implicit biases in learning.
Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation
Theory
- Weak-PDE-Net is an end-to-end differentiable framework for discovering open-form PDEs.
- The framework combines a forward response learner with a weak-form PDE generator to enhance robustness against noise.
- Differentiable Neural Architecture Search is employed to dynamically construct a library of function terms for PDE discovery.
- Physical constraints are integrated to ensure that discovered equations adhere to physical laws.
Read more
Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation
Summary
The paper introduces Weak-PDE-Net, an innovative framework designed to discover governing Partial Differential Equations (PDEs) from sparse and noisy data. Traditional sparse regression methods face significant challenges, particularly the instability of numerical differentiation and the limitations imposed by pre-defined candidate libraries. Weak-PDE-Net addresses these issues through an end-to-end differentiable approach that integrates two main components: a forward response learner and a weak-form PDE generator. The response learner utilizes learnable Gaussian kernels within a lightweight Multilayer Perceptron (MLP) to effectively capture system dynamics from limited observations. The PDE generator employs a symbolic network combined with an integral module to construct weak-form PDEs, enhancing robustness against noise and eliminating the need for explicit numerical differentiation. Furthermore, the framework incorporates a Differentiable Neural Architecture Search strategy to explore the functional space, allowing for the efficient discovery of open-form PDEs. The methodology also includes physical constraints such as Galilean Invariance and symmetry equivariance to ensure the physical consistency of the discovered equations. Experimental results demonstrate that Weak-PDE-Net can accurately recover governing equations even under challenging conditions of sparse and noisy data.
Methodology
Weak-PDE-Net consists of two interconnected modules: a forward response learner that uses learnable Gaussian kernels within a lightweight MLP to capture system dynamics, and a weak-form PDE generator that integrates a symbolic network with an integral module to construct PDEs. The training process involves three phases: Searching, Pruning, and Tuning, with a focus on differentiable symbolic network architecture search to adaptively explore the functional space.
Results
The experiments conducted on various PDE benchmarks indicate that Weak-PDE-Net successfully recovers governing equations with high accuracy, even when faced with sparse and noisy data. The integration of adaptive Gaussian kernels and physical constraints significantly enhances the model's performance and robustness.
Implications
Weak-PDE-Net has the potential to advance the field of data-driven scientific computing by providing a robust method for discovering governing equations in various applications, including fluid dynamics, pharmacological processes, and climate modeling. Its ability to handle noisy data and discover open-form PDEs can lead to more accurate modeling of complex natural phenomena.