AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
66
Papers today
8h
Update frequency
7
Days of history
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Large Language Models
Efficient ML
- CARE enhances the expressivity of attention models without increasing KV-cache costs.
- The method introduces activation-preserving factorization to align approximations with input activations.
- Adjusted-rank allocation optimizes the distribution of KV budgets across layers based on their spectral characteristics.
- CARE outperforms traditional SVD-based approaches, demonstrating significant improvements in perplexity and accuracy.
Read more
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Summary
The paper introduces CARE, a novel conversion pipeline designed to transform pretrained attention modules, specifically grouped-query attention (GQA), into multi-head latent attention (MLA). This transformation enhances expressivity without increasing key-value (KV) cache costs, which is crucial for efficient inference in large language models (LLMs). Traditional methods for this conversion often rely on weight-only low-rank approximations and uniform rank allocation, leading to activation drift and reduced attention fidelity. CARE addresses these issues through three innovative steps: (1) activation-preserving factorization, which aligns the approximation with actual input activations; (2) adjusted-rank allocation, which allocates a fixed KV budget across layers based on their needs; and (3) KV-parity mapping, which ensures the converted K and V fit the MLA format while maintaining the KV-cache size. The proposed method significantly outperforms a uniform-rank SVD baseline, achieving substantial reductions in one-shot perplexity and improvements in mean accuracy, while also allowing for recovery of the original model's accuracy with minimal fine-tuning.
Methodology
CARE employs a three-step approach: first, it utilizes activation-preserving factorization by applying SVD to the covariance of input activations, ensuring that the decomposition reflects the actual input dynamics. Second, it implements adjusted-rank allocation to distribute a fixed KV budget across layers based on their spectral properties, allowing for more nuanced rank assignments. Finally, it employs KV-parity mapping to reparameterize the converted K and V matrices while maintaining the original KV-cache size.
Results
The CARE method demonstrated a reduction in one-shot perplexity by up to 215 times and an improvement in mean accuracy by up to 1.70 times compared to a uniform-rank SVD baseline, all while adhering to matched KV budgets. Additionally, a brief post-SVD fine-tuning process successfully restored the original model's accuracy.
Implications
The advancements presented in CARE could lead to more efficient deployment of large language models in real-world applications, particularly where memory and computational resources are constrained. By improving the expressivity of attention mechanisms without increasing costs, CARE may facilitate the development of more capable and efficient AI systems.
Foundations of Schrödinger Bridges for Generative Modeling
Generative Models
Theory
Optimization
- Schrödinger bridges serve as a unifying framework for various generative modeling techniques.
- The paper develops both static and dynamic formulations of the Schrödinger bridge problem.
- A comprehensive toolkit for constructing Schrödinger bridges is provided, facilitating task-specific applications.
- The framework connects to modern generative modeling approaches, enhancing their theoretical foundation.
Read more
Foundations of Schrödinger Bridges for Generative Modeling
Summary
This paper presents a comprehensive guide to the mathematical foundations of Schrödinger bridges, a theoretical framework that unifies various modern generative modeling techniques, including diffusion models, score-based models, and flow matching. The author frames the generative modeling problem as finding an optimal stochastic bridge between marginal distribution constraints while minimizing entropy deviations from a predefined reference process. The guide explores both static and dynamic formulations of the Schrödinger bridge problem, leveraging concepts from optimal transport, stochastic control, and path-space optimization. It provides a toolkit for constructing Schrödinger bridges from first principles and demonstrates how these constructions can lead to generalized and task-specific computational methods. The paper also discusses applications of Schrödinger bridges in generative modeling, including data translation, modeling single-cell state dynamics, and sampling Boltzmann distributions, highlighting their potential to simplify and enhance the generative modeling landscape.
Methodology
The author employs mathematical concepts from optimal transport and stochastic control to derive the Schrödinger bridge framework. The guide includes detailed discussions on static and dynamic formulations, path measures, and stochastic differential equations, culminating in a toolkit for practical applications in generative modeling.
Results
The paper establishes a theoretical foundation for Schrödinger bridges, demonstrating their applicability in generative modeling. It provides insights into constructing these bridges and illustrates their effectiveness in various applications, thereby enhancing the understanding and usability of generative models.
Implications
The findings suggest that Schrödinger bridges can simplify the generative modeling process, offering a principled approach to tackle complex scientific problems. This framework could lead to improved algorithms for generative tasks across diverse domains, including language, video, and scientific data analysis.
Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
Multimodal
Optimization
Generative Models
- DPO cannot improve generation quality in Janus-Pro's VQ-based architecture.
- Magnitude imbalance between understanding and generation gradients is the primary interference mechanism.
- Dynamic gradient reweighting can preserve understanding gains in multi-task DPO.
- The findings are consistent across different model scales (1B and 7B parameters).
Read more
Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
Summary
This paper investigates the interaction between understanding and generation capabilities in unified multimodal models, specifically focusing on the application of Direct Preference Optimization (DPO) to the Janus-Pro architecture. The authors conduct a systematic study to determine whether DPO can align these two capabilities simultaneously. The findings reveal that generation quality resists DPO alignment across various training strategies and post-hoc methods, with no improvements observed in generation CLIPScore at the 7B parameter scale and degradation at the 1B scale. The study identifies a significant interference mechanism due to the magnitude imbalance between understanding and generation gradients, with generation gradients being approximately 11-14 times larger. This imbalance is attributed to the structural design of the model, particularly the discrete VQ tokenization, which creates a bottleneck. The authors provide practical guidance for practitioners, suggesting that while understanding-only DPO is effective, improvements in generation quality may require on-policy methods or continuous-representation architectures.
Methodology
The authors applied DPO to the Janus-Pro architecture at two scales (1B and 7B parameters) under seven training strategies and two post-hoc methods. They conducted gradient analysis to assess the interaction between understanding and generation tasks, focusing on the magnitude of gradients and their orthogonality.
Results
The study found that no method improved generation quality across all tested conditions, with generation CLIPScore remaining unchanged at 7B and degrading at 1B. Gradient analysis revealed near-orthogonal understanding and generation gradients, with a significant magnitude imbalance that negatively impacted the alignment of both tasks.
Implications
The findings suggest that optimizing for both understanding and generation in unified multimodal models may lead to conflicts that degrade performance. Practitioners are advised to focus on understanding-only DPO for reliable outcomes, while improvements in generation may necessitate alternative approaches.
Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction
Time Series
Multimodal
- Introduction of a novel framework for ECG reconstruction that incorporates pathology-aware embeddings.
- Achieved a 76% reduction in RMSE compared to state-of-the-art models on the PTB-XL dataset.
- Demonstrated robust cross-dataset generalization, enhancing the reliability of ECG reconstruction across different populations.
- Addresses the limitations of standard deep learning methods by focusing on pathological rather than anatomical variations.
Read more
Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction
Summary
This paper addresses the challenge of reconstructing a 12-lead electrocardiogram (ECG) from a reduced lead set, which is complicated by anatomical variability and the loss of vital morphology in precordial leads. The authors propose a novel framework called Pathology-Aware Multi-View Contrastive Learning that regularizes the latent space through a pathological manifold. This architecture integrates high-fidelity time-domain waveforms with pathology-aware embeddings learned via supervised contrastive alignment. By maximizing mutual information between latent representations and clinical labels, the framework effectively filters out anatomical 'nuisance' variables. The proposed method was evaluated on the PTB-XL dataset, achieving a significant reduction in root mean square error (RMSE) of approximately 76% compared to existing state-of-the-art models in a patient-independent setting. Additionally, cross-dataset evaluation on the PTB Diagnostic Database demonstrated superior generalization, highlighting the framework's potential to bridge the gap between hardware portability and diagnostic-grade ECG reconstruction.
Methodology
The proposed methodology consists of three stages: signal preprocessing, dual-representation construction, and stacked latent decoding. It reformulates ECG lead reconstruction as a multi-view integration task, combining morphological fidelity with a pathology-aware latent space. The framework utilizes supervised contrastive learning to maximize mutual information between latent representations and clinical labels, effectively partitioning the latent space by clinical condition.
Results
The proposed framework achieved a 76% reduction in RMSE on the PTB-XL dataset compared to existing benchmarks, showcasing its effectiveness in patient-independent settings. The method also demonstrated superior generalization capabilities in cross-dataset evaluations, indicating its robustness and applicability in diverse clinical scenarios.
Implications
The findings suggest that the proposed framework could significantly improve the accuracy and reliability of ECG reconstructions in portable monitoring devices, making it suitable for continuous monitoring in ambulatory and home settings. This advancement could enhance patient care by providing high-fidelity ECG data without the need for extensive electrode placement.
Epistemic Generative Adversarial Networks
Generative Models
Theory
Interpretability
- Introduction of Epistemic Generative Adversarial Networks (EGANs) to enhance output diversity in GANs.
- Utilization of Dempster-Shafer theory for a novel GAN loss function that incorporates uncertainty quantification.
- Architectural modifications to the generator for pixel-wise mass function prediction, improving sample diversity.
- Experimental results show significant improvements in generation variability and uncertainty modeling.
Read more
Epistemic Generative Adversarial Networks
Summary
This paper addresses the issue of output diversity in Generative Adversarial Networks (GANs), which often generate similar samples due to mode collapse. The authors propose a novel approach called Epistemic Generative Adversarial Networks (EGANs), which utilizes Dempster-Shafer theory of evidence to enhance the GAN loss function. This approach allows both the generator and discriminator to predict belief functions, enabling the quantification of uncertainty in generated outputs. The generator is architecturally modified to estimate a mass function for each image pixel, which aids in producing more diverse and representative samples. Experimental results demonstrate that EGANs not only improve generation variability but also provide a principled framework for modeling and interpreting uncertainty in generative processes, thereby addressing a critical challenge in the deployment of GANs in applications requiring high output diversity.
Methodology
The authors modify the traditional GAN framework by integrating Dempster-Shafer theory, allowing the discriminator to predict belief functions instead of probabilities. The generator is enhanced to estimate region-wise uncertainty through belief function predictions. A generalized GAN loss formulation is developed to operate within this belief function framework.
Results
The experimental findings indicate that EGANs significantly enhance the diversity of generated samples compared to traditional GANs. The approach also successfully quantifies uncertainty, providing a more robust and interpretable generative model that can be applied in various domains, particularly where output diversity is crucial.
Implications
The proposed EGAN framework has potential applications in fields requiring diverse outputs, such as medical imaging and creative content generation. By addressing mode collapse and improving uncertainty quantification, EGANs can enhance the reliability and trustworthiness of generative models in practical deployments.
Gradient-Informed Temporal Sampling Improves Rollout Accuracy in PDE Surrogate Training
Optimization
Time Series
Theory
- Introduces Gradient-Informed Temporal Sampling (GITS) for improved data selection in PDE surrogate training.
- GITS optimizes local gradient information and temporal coverage to enhance model performance.
- Demonstrates lower rollout error across multiple PDE systems and neural architectures compared to baseline methods.
- Ablation studies confirm the necessity of both optimization objectives in GITS.
Read more
Gradient-Informed Temporal Sampling Improves Rollout Accuracy in PDE Surrogate Training
Summary
This paper addresses the challenge of effectively sampling training data for neural simulators used in Partial Differential Equation (PDE) surrogate training. Traditional methods often rely on uniformly sampled data, which may not yield the most informative training pairs. The authors propose a novel sampling method called Gradient-Informed Temporal Sampling (GITS), which optimizes both local gradient information and temporal coverage to enhance the rollout accuracy of neural simulators. GITS balances the need for model-specific data selection while ensuring diverse temporal representation, overcoming the limitations of existing sampling techniques. Through experiments on various PDE systems, the authors demonstrate that GITS significantly reduces rollout error compared to several baseline methods. The paper also includes ablation studies that highlight the importance of the dual objectives in GITS and provides insights into the conditions under which GITS excels or fails.
Methodology
The authors developed GITS, which combines a pilot-model short-horizon gradient-norm score with a set-level temporal coverage objective. This approach allows for the selection of training data that is both informative and diverse, addressing the shortcomings of existing sampling methods that either focus too narrowly on high-information-density regions or lack model specificity.
Results
GITS was shown to achieve lower rollout errors than several temporal selection baselines across different PDE systems and neural simulator architectures. The ablation studies indicated that both components of GITS are essential for its performance, and the analysis provided insights into the successful sampling patterns and the limitations of GITS.
Implications
The findings suggest that GITS can significantly enhance the training efficiency and accuracy of neural simulators in PDE applications. This method could be applied to other domains where surrogate modeling and data sampling are critical, potentially leading to advancements in simulation-based modeling and machine learning applications.
BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection
Time Series
Reinforcement Learning
Optimization
- Introduces a reconstruction-driven boundary negative generation framework for TSAD.
- Utilizes reinforcement learning to adaptively control the generation of hard negatives.
- Improves anomaly representation learning by focusing on boundary-aware negative samples.
- Achieves competitive detection performance on benchmark datasets.
Read more
BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection
Summary
The paper presents BoundAD, a novel framework for time series anomaly detection (TSAD) that focuses on improving the quality of negative sample generation through a boundary-aware approach. Traditional contrastive learning methods in TSAD often rely on random perturbations or pseudo-anomaly injections, which can fail to maintain temporal semantic consistency and effectively supervise decision boundaries. BoundAD addresses these limitations by utilizing a reconstruction-driven process to generate hard negatives directly from normal samples. The framework employs a reconstruction network to capture normal temporal patterns and incorporates reinforcement learning to adaptively adjust the optimization process based on the current reconstruction state. This allows for the generation of boundary-shifted samples that are close to the normal data manifold, enhancing the contrastive representation learning. The experimental results demonstrate that BoundAD significantly improves anomaly representation learning and achieves competitive performance on benchmark datasets, showcasing its effectiveness in generating challenging negative samples without predefined anomaly patterns.
Methodology
The methodology involves a reconstruction network to learn normal temporal patterns and a reinforcement learning strategy to adjust the optimization process dynamically. This combination allows for the generation of hard negatives that are strategically located near the boundary of normal data, facilitating better contrastive representation learning.
Results
The proposed BoundAD framework shows significant improvements in anomaly detection performance compared to existing methods, particularly in terms of reducing false positive rates and enhancing the discriminability of the learned representations. The experimental results validate the effectiveness of the boundary-aware negative generation approach.
Implications
The findings suggest that BoundAD can be applied in various real-world scenarios where time series anomaly detection is critical, such as industrial monitoring, healthcare, and cybersecurity. The framework's ability to generate informative negative samples without predefined anomalies could lead to more robust and adaptable TSAD systems.
Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL
Large Language Models
Theory
Interpretability
- Introduces a new framework (ProbPol) for conflict detection in policy languages using probabilistic ML signals.
- Defines three new conflict types specific to probabilistic predicates and organizes them in a decidability hierarchy.
- Proposes a method to eliminate co-firing of embedding signals using temperature-scaled softmax.
- Implements detection and prevention mechanisms in the Semantic Router DSL for practical application.
Read more
Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL
Summary
This paper addresses the challenges of conflict detection in policy languages that utilize probabilistic machine learning (ML) signals, which differ from traditional crisp Boolean predicates. The authors identify a gap in existing conflict detection tools, which are designed for deterministic rules, and propose a new framework called ProbPol that accommodates probabilistic predicates. They categorize conflicts into three types—probable conflict, soft shadowing, and calibration conflict—arranging them in a decidability hierarchy. The paper emphasizes that embedding conflicts can be resolved by using a temperature-scaled softmax function to partition the embedding space into Voronoi regions, thus preventing co-firing of signals without requiring model retraining. The authors implement these concepts in the Semantic Router DSL, a production language for routing in large language model (LLM) inference, and demonstrate how similar conflict types and resolutions can be applied to semantic role-based access control (RBAC) and API gateway policies. The findings highlight the need for formal semantics in ML routing systems and provide practical tools for conflict detection and resolution.
Methodology
The authors develop a theoretical framework that categorizes conflicts in policy languages based on probabilistic predicates. They implement a series of compiler-level checks in the Semantic Router DSL, including category overlap detection and diagnostics for conflict resolution. The methodology includes replacing independent thresholding with a temperature-scaled softmax to manage embedding conflicts.
Results
The implementation of the ProbPol framework in the Semantic Router DSL successfully detects and prevents conflicts arising from probabilistic ML signals. The proposed method for managing embedding signals effectively partitions the input space, eliminating the risk of co-firing without necessitating model retraining. The framework also extends its applicability to other domains such as semantic RBAC and API gateways.
Implications
The findings suggest that formal semantics are crucial for routing systems that rely on probabilistic ML signals. The proposed framework and tools can enhance the reliability of routing and access control systems, potentially leading to improved performance in applications involving large language models and other ML-driven systems.
Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization
Large Language Models
Reinforcement Learning
Graph Learning
- Introduces an anonymization protocol to prevent memorization biases in LLM trading agents.
- Develops a multi-agent system where specialized LLMs assess stocks independently and provide reasoning.
- Proposes a Semantic Graph Encoder to learn inter-stock relationships under anonymization.
- Demonstrates rigorous validation of LLM signals to ensure predictive power and mitigate biases.
Read more
Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization
Summary
This paper presents BlindTrade, an anonymization-first framework designed to enhance the reliability of large language model (LLM) trading agents by mitigating memorization and survivorship biases. The authors argue that for LLMs to be trustworthy in financial trading, they must demonstrate genuine understanding of market dynamics rather than relying on memorized associations with specific tickers. BlindTrade anonymizes stock identifiers and employs four specialized LLM agents that evaluate stocks from different perspectives—Momentum, News-Event, Mean-Reversion, and Risk-Regime—each providing scores and reasoning for their assessments. The framework utilizes a Semantic Graph Encoder (SemGAT) to construct a graph based on reasoning embeddings, facilitating inter-stock relationship learning. A reinforcement learning (RL) policy, specifically PPO-DSR, is then applied to determine portfolio weights. The authors validate the predictive power of the LLM outputs through rigorous signal validation techniques, including IC analysis and negative control experiments. The results indicate a Sharpe ratio of 1.40 ± 0.22 over a testing period, with performance varying based on market conditions, excelling in volatile environments but showing reduced alpha in trending bull markets.
Methodology
The methodology involves six stages: data anonymization of S&P 500 constituents, feature generation by four specialized LLM agents, validation of predictive power through IC analysis, construction of a semantic graph using reasoning embeddings, application of a PPO-based RL policy for portfolio optimization, and backtesting with transaction costs.
Results
The BlindTrade framework achieved a Sharpe ratio of 1.40 ± 0.22 across 20 seeds during the evaluation period (2025 YTD). The results validated the legitimacy of the signals through negative control experiments, confirming that the LLM outputs reflect genuine market patterns rather than artifacts of memorization.
Implications
The findings suggest that anonymization can enhance the reliability of LLMs in financial trading, potentially leading to more trustworthy and effective trading strategies. This approach may also inform future research on the application of AI in finance, emphasizing the need for rigorous validation methods.
Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification
Graph Learning
Theory
- Spectral GNNs do not capture the graph spectrum meaningfully.
- Commonly used graph Fourier bases are not true Fourier bases.
- Polynomial approximations in Spectral GNNs are theoretically flawed.
- The effectiveness of Spectral GNNs is primarily due to message-passing dynamics.
Read more
Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification
Summary
This position paper critically examines the theoretical foundations of Spectral Graph Neural Networks (Spectral GNNs) in the context of node classification. The authors argue that Spectral GNNs do not effectively capture the graph spectrum and do not provide superior performance compared to simpler Message Passing Neural Networks (MPNNs). They identify two main theoretical flaws: first, the commonly used 'graph Fourier bases' do not qualify as true Fourier bases for graph signals; second, the polynomial approximations used in Spectral GNNs fail to justify their effectiveness, as they can interpolate spectral responses exactly rather than approximate them. The paper demonstrates that the low- and high-pass filtering behaviors attributed to Spectral GNNs arise from message-passing dynamics rather than spectral formulations. The authors analyze two specific models, MagNet and HoloNet, revealing that their empirical success is due to implementation issues that reduce them to MPNNs, rather than any genuine spectral mechanism. Overall, the paper argues that the competitive results of Spectral GNNs can be better explained by their equivalence to MPNNs, challenging the notion that they are superior for node classification tasks.
Methodology
The authors provide theoretical proofs to demonstrate the inadequacies of graph Fourier bases and polynomial approximations. They analyze the message-passing dynamics of Spectral GNNs and compare the performance of specific models (MagNet and HoloNet) under consistent implementation with their claimed spectral algorithms.
Results
The paper concludes that Spectral GNNs do not offer meaningful advantages for node classification and that their performance can be attributed to their equivalence to simpler MPNNs. When implemented correctly, the performance of models like MagNet and HoloNet diminishes, supporting the authors' claims.
Implications
The findings suggest a need for reevaluation of the theoretical foundations of Spectral GNNs and their applications in node classification. Researchers may need to focus on the underlying message-passing mechanisms rather than spectral interpretations to improve model performance.
Path-Constrained Mixture-of-Experts
NLP
Large Language Models
Efficient ML
- PathMoE constrains the expert path space by sharing router parameters across consecutive layers.
- The method shows consistent performance improvements on language modeling tasks compared to independent routing.
- PathMoE eliminates the need for auxiliary load balancing losses while maintaining balanced expert utilization.
- Tokens following the same expert path exhibit clustering by linguistic function, enhancing interpretability.
Read more
Path-Constrained Mixture-of-Experts
Summary
The paper introduces PathMoE, a novel approach to Mixture-of-Experts (MoE) architectures that addresses the inefficiencies of conventional independent routing. Traditional MoE models allow for a vast number of expert paths due to independent expert selection at each layer, leading to statistical inefficiency as many paths remain unexplored during training. PathMoE mitigates this by sharing router parameters across consecutive layers, which encourages coherence in expert selection while maintaining flexibility. The authors demonstrate that PathMoE improves performance on language modeling tasks, yielding lower perplexity and better accuracy on downstream tasks without the need for auxiliary load balancing losses. The analysis reveals that tokens following the same path cluster by linguistic function, indicating a more interpretable structure in expert paths. Overall, PathMoE provides a new perspective on MoE architectures, emphasizing the importance of expert path coherence for enhanced model performance.
Methodology
The authors propose PathMoE, which shares router parameters across blocks of consecutive layers in MoE architectures. This approach allows for a more coherent routing decision process while still adapting to the evolving representations of tokens. The methodology includes experiments on models with 0.9B and 16B parameters to validate the effectiveness of PathMoE against conventional independent routing.
Results
Experiments demonstrate that PathMoE leads to significant improvements in perplexity and accuracy on downstream tasks. Specifically, it achieves 31% higher routing consistency and 11% lower routing entropy compared to conventional routing methods, while being 22.5 times more robust to routing perturbations.
Implications
The findings suggest that constraining the expert path space can enhance the efficiency and interpretability of MoE architectures, making them more suitable for large-scale language modeling and potentially other NLP tasks. This approach could lead to more effective training strategies in future large language models.
Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
NLP
Large Language Models
Theory
- Detection of sensitive concepts is trivial, but routing through behavioral policies is complex and model-specific.
- Surgical ablation can effectively remove censorship in many models, leading to accurate outputs.
- Refusal-based evaluations fail to capture the nuanced steering mechanisms that influence model behavior.
- Different labs organize political and safety representations in distinct geometries, affecting model outputs.
Read more
Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Summary
This paper investigates the limitations of current alignment evaluation methods in language models, particularly focusing on political censorship as a case study. The author argues that while detection of sensitive concepts is straightforward, the subsequent routing of these concepts through behavioral policies is complex and varies significantly across different models and labs. The study reveals that perfect accuracy in detecting political content does not correlate with effective alignment, as similar detection rates can be achieved for unrelated categories. Through surgical ablation experiments on nine open-weight models from five labs, the author finds that removing censorship mechanisms often leads to accurate factual outputs, although some models may confabulate due to architectural entanglements. The research highlights that refusal-based evaluations are inadequate, as models may exhibit high steering towards approved narratives without outright refusals. The findings support a three-stage framework for understanding alignment: detection, routing, and output generation, emphasizing the need for more nuanced evaluation methods that account for the learned routing mechanisms governing model behavior.
Methodology
The study employs a combination of probing, surgical ablation, and behavioral tests across nine open-weight models from five different labs. It evaluates the models' responses to politically sensitive prompts and analyzes the effects of removing censorship mechanisms.
Results
The research finds that probe accuracy is non-diagnostic for alignment, as models can achieve high accuracy on unrelated categories. Surgical ablation experiments show that removing political-sensitivity directions often results in accurate factual outputs, although one model confabulates due to architectural issues. The study also reveals that refusal rates can drop significantly while narrative steering increases, indicating a shift in censorship mechanisms.
Implications
The findings suggest that current alignment evaluation methods may overlook critical aspects of model behavior, particularly in politically sensitive contexts. This has implications for developing more effective safety evaluations and understanding how models manage sensitive information.
InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
Efficient ML
Computer Vision
NLP
- Introduces InfoMamba, an attention-free hybrid architecture combining SSMs and a global filtering layer.
- Develops a consistency boundary analysis to identify regimes for effective modeling of local and global interactions.
- Implements information-maximizing fusion (IMF) to enhance the integration of global context with local details.
- Achieves strong performance improvements over state-of-the-art Transformer and SSM baselines.
Read more
InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
Summary
The paper presents InfoMamba, a novel hybrid architecture that integrates the strengths of Mamba-style selective state-space models (SSMs) and Transformers while avoiding the computational overhead associated with self-attention mechanisms. The authors identify the limitations of existing models in balancing fine-grained local modeling with long-range dependencies, particularly the quadratic complexity of Transformers and the insufficient global interactions in SSMs. Through a consistency boundary analysis, they reveal the conditions under which diagonal short-memory SSMs can approximate causal attention and highlight the structural gaps that remain. InfoMamba replaces traditional token-level self-attention with a concept-bottleneck linear filtering layer, which acts as a minimal-bandwidth global interface. This is coupled with a selective recurrent stream via information-maximizing fusion (IMF), which dynamically integrates global context into SSM dynamics while enforcing complementary information usage through a mutual-information-inspired objective. The architecture achieves near-linear scaling and demonstrates superior performance across various tasks, including classification, dense prediction, and non-vision benchmarks, thereby providing a favorable accuracy-efficiency trade-off.
Methodology
The authors conducted a consistency boundary analysis to understand the limitations of existing models in capturing global interactions. They designed InfoMamba by integrating a concept-bottleneck linear filtering layer with a selective recurrent stream, utilizing information-maximizing fusion (IMF) to couple the two pathways. The architecture is guided by an information-theoretic objective to ensure effective use of global context and local detail.
Results
InfoMamba consistently outperformed state-of-the-art Transformer and SSM baselines across various tasks, achieving significant gains in accuracy while maintaining computational efficiency. The architecture demonstrated near-linear scaling, making it suitable for applications requiring long-context modeling.
Implications
The development of InfoMamba has potential applications in fields requiring efficient sequence modeling, such as natural language processing, computer vision, and time-series forecasting. Its ability to balance local and global interactions without incurring high computational costs could lead to more scalable and effective models in these domains.
Transformers Learn Robust In-Context Regression under Distributional Uncertainty
Theory
- Transformers can perform in-context learning for linear regression under non-Gaussian and non-i.i.d. conditions.
- The study systematically evaluates the impact of distributional shifts on Transformer performance.
- Transformers consistently match or outperform classical regression methods like OLS and Ridge regression.
- The findings suggest that Transformers exhibit robustness and adaptability beyond traditional estimators.
Read more
Transformers Learn Robust In-Context Regression under Distributional Uncertainty
Summary
This paper investigates the ability of Transformers to perform in-context learning (ICL) for linear regression tasks under realistic conditions that deviate from traditional assumptions such as i.i.d. data and Gaussian noise. Previous studies have shown that Transformers can effectively mimic ordinary least squares (OLS) regression in controlled environments, but real-world data often presents challenges such as non-Gaussian distributions, heavy-tailed noise, and dependencies among inputs. The authors systematically explore how Transformers adapt to these distributional uncertainties by varying the distributions of features, regression coefficients, and noise. They compare the performance of Transformers against classical regression methods like OLS and Ridge regression across various scenarios. The findings reveal that Transformers not only match but often outperform classical estimators, demonstrating robustness and adaptability in the face of distributional shifts. This research contributes to a deeper understanding of the conditions under which Transformers can effectively learn in-context, highlighting their potential advantages over traditional statistical methods in complex real-world applications.
Methodology
The authors conducted a comprehensive empirical investigation by varying the distributions of features, regression coefficients, and noise in linear regression tasks. They compared the performance of Transformers against classical baselines, isolating the effects of each distributional shift to assess the robustness and adaptability of the Transformers in various scenarios.
Results
The results indicate that Transformers maintain strong performance across a range of distributional shifts, often outperforming classical regression methods. This demonstrates their ability to adapt to complex data structures and noise distributions, suggesting that they can effectively learn in-context even when traditional assumptions do not hold.
Implications
The findings have significant implications for the deployment of Transformers in real-world applications where data often deviates from idealized conditions. This research suggests that Transformers could be a viable alternative to classical statistical methods in scenarios characterized by distributional uncertainty, potentially enhancing predictive performance in various fields such as finance, healthcare, and engineering.
DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning
Federated Learning
- DriftGuard effectively mitigates asynchronous data drift in Federated Learning.
- The framework utilizes a Mixture-of-Experts architecture to separate global and local parameters.
- It implements a two-level retraining mechanism that balances accuracy and computational overhead.
- DriftGuard reduces total retraining costs by up to 83% while maintaining high accuracy.
Read more
DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning
Summary
The paper introduces DriftGuard, a novel framework designed to address the challenges of asynchronous data drift in Federated Learning (FL). In real-world FL scenarios, data distributions on devices evolve over time, leading to asynchronous data drift where devices experience shifts at different rates and towards different distributions. Traditional FL methods struggle with this issue due to high computational costs associated with frequent retraining and performance degradation from infrequent retraining. DriftGuard employs a Mixture-of-Experts (MoE) inspired architecture that separates shared parameters, which capture globally transferable knowledge, from local parameters that adapt to specific group distributions. This architecture enables two retraining strategies: global retraining for system-wide drift and group retraining for clusters of devices with similar data distributions, identified through MoE gating patterns. The framework allows for efficient adaptation to data drift while minimizing retraining costs. Experimental results demonstrate that DriftGuard achieves state-of-the-art accuracy while reducing total retraining costs by up to 83%, significantly improving the accuracy per unit retraining cost compared to existing methods.
Methodology
DriftGuard employs a Mixture-of-Experts inspired architecture that separates shared and local parameters. It utilizes a two-level retraining mechanism, consisting of global retraining when system-wide drift is detected and group retraining for clusters of devices with similar data distributions, identified through MoE gating outputs. This approach minimizes the need for frequent retraining while maintaining model performance.
Results
Experiments across multiple datasets and models show that DriftGuard matches or exceeds the accuracy of state-of-the-art methods while reducing total retraining costs by up to 83%. It achieves the highest accuracy per unit retraining cost, improving over the strongest baseline by up to 2.3×. In real-world IoT applications, it consistently delivers the highest accuracy while reducing retraining time by up to 20%.
Implications
DriftGuard's approach can significantly enhance the efficiency of Federated Learning systems, particularly in resource-constrained environments such as IoT devices. Its ability to adapt to evolving data distributions while minimizing computational costs has potential applications in various fields, including smart cities, healthcare, and any domain reliant on decentralized data generation.
Federated Distributional Reinforcement Learning with Distributional Critic Regularization
Reinforcement Learning
Federated Learning
Robotics
- Introduction of Federated Distributional Reinforcement Learning (FedDistRL) to address limitations of traditional FRL methods.
- Development of TR-FedDistRL, which uses a risk-aware Wasserstein barycenter for critic updates.
- Demonstration of empirical improvements in safety metrics and reduced mean-smearing.
- Theoretical stability results for the constrained critic update process.
Read more
Federated Distributional Reinforcement Learning with Distributional Critic Regularization
Summary
This paper introduces Federated Distributional Reinforcement Learning (FedDistRL), a novel approach that addresses the limitations of traditional federated reinforcement learning (FRL) methods, which typically rely on parameter averaging that can obscure critical statistical properties in safety-sensitive applications. The authors propose a framework where clients maintain local policy networks while federating distributional critics that utilize quantile value functions. To enhance safety and mitigate mean-smearing effects during parameter aggregation, they introduce TR-FedDistRL, which employs a risk-aware Wasserstein barycenter constructed from recent critic outputs. This barycenter serves as a reference for constraining the updates of the federated critic, ensuring that important distributional characteristics are preserved. The paper provides theoretical stability results for the proposed method and demonstrates its effectiveness through experiments in various environments, showing improvements in safety metrics and reduced drift in critic and policy estimates compared to traditional mean-oriented approaches.
Methodology
The methodology involves federating distributional critics while keeping policy networks local. A risk-aware Wasserstein barycenter is constructed from a temporal buffer of recent critic outputs, which serves as a reference for constraining the updates of the federated critic. This approach prevents the averaging process from collapsing distinct return distributions and ensures that critical statistical properties are maintained.
Results
Experiments conducted in bandit, multi-agent gridworld, and continuous highway environments showed that TR-FedDistRL significantly reduced mean-smearing effects, improved safety proxies (such as catastrophe and accident rates), and exhibited lower critic and policy drift compared to mean-oriented and non-federated baselines.
Implications
The proposed FedDistRL framework has significant implications for safety-critical applications such as autonomous driving and personal robotics, where preserving the distributional characteristics of return estimates is crucial for safe decision-making. It opens avenues for further research into risk-aware federated learning strategies.
On Optimizing Multimodal Jailbreaks for Spoken Language Models
Multimodal
Audio & Speech
Optimization
- Introduction of JAMA, a joint multimodal optimization framework for jailbreaking SLMs.
- Demonstrated 1.5x to 10x improvement in jailbreak success rates over unimodal methods.
- Proposed SAMA as a faster alternative to JAMA with similar effectiveness.
- Highlighted the need for robust defenses against multimodal adversarial attacks.
Read more
On Optimizing Multimodal Jailbreaks for Spoken Language Models
Summary
This paper addresses the vulnerabilities of Spoken Language Models (SLMs) to adversarial attacks, specifically focusing on jailbreaking techniques that exploit both text and audio modalities. The authors introduce JAMA (Joint Audio-text Multimodal Attack), a novel framework that employs a joint optimization strategy combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio. This approach allows for simultaneous perturbation of both modalities, significantly enhancing the effectiveness of jailbreak attempts. The study evaluates JAMA across four state-of-the-art SLMs and various audio types, demonstrating that it outperforms unimodal attacks by a factor of 1.5 to 10 times. Additionally, the authors propose a sequential optimization method, SAMA (Sequential Audio-text Multimodal Attack), which approximates JAMA with a speedup of 4 to 6 times while maintaining comparable jailbreak rates. The findings emphasize the inadequacy of unimodal safety measures for robust SLMs and advocate for stronger defenses against multimodal attacks.
Methodology
The authors developed JAMA by combining GCG for text and PGD for audio to optimize adversarial inputs across both modalities simultaneously. The joint loss function was computed over a batch of malicious queries, allowing for coordinated perturbations that maximize the likelihood of harmful responses from the model.
Results
JAMA achieved a jailbreak success rate that was 1.5 to 10 times higher than existing unimodal attacks. The sequential method SAMA provided a 4 to 6 times speedup while maintaining similar effectiveness, indicating that multimodal optimization is crucial for assessing the security of SLMs.
Implications
The findings suggest that current safety measures for SLMs are insufficient when only unimodal evaluations are considered. This work calls for enhanced security protocols in multimodal systems to prevent adversarial exploitation, which could have significant implications for the deployment of SLMs in real-world applications.
Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction
Audio & Speech
- Multi-corpus training can lead to performance degradation in spoofing detection due to dataset-specific biases.
- The proposed IDFE framework effectively reduces corpus-specific information in embeddings, enhancing generalization.
- The framework achieves a 20% reduction in average EER across multiple datasets compared to baseline models.
- The study highlights the importance of addressing dataset biases for improving the reliability of spoofing detection systems.
Read more
Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction
Summary
This paper addresses the challenges of performance variability in speech spoofing detection systems when trained on multiple corpora. The authors hypothesize that dataset-specific biases hinder generalization, leading to inconsistent performance. To mitigate these issues, they propose the Invariant Domain Feature Extraction (IDFE) framework, which utilizes multi-task learning and a gradient reversal layer to minimize corpus-specific information in the learned embeddings. The framework aims to enhance the robustness of spoofing detection models by focusing on discriminative features relevant to spoofing while reducing the influence of dataset biases. The experiments conducted across four diverse datasets demonstrate that the IDFE framework achieves a 20% reduction in the average equal error rate (EER) compared to the baseline models, indicating significant improvements in detection performance and generalization capabilities.
Methodology
The authors employed a multi-task learning approach combined with a gradient reversal layer within the IDFE framework to suppress dataset-specific cues in the embedding space. They conducted experiments using four different datasets to analyze the impact of multi-corpus training and to assess the performance of their proposed method against baseline models.
Results
The IDFE framework resulted in a 20% reduction in average equal error rate (EER) when evaluated across four varied datasets, demonstrating its effectiveness in improving the performance of SSL-based anti-spoofing models.
Implications
The findings suggest that addressing dataset biases is crucial for enhancing the robustness and reliability of spoofing detection systems. The IDFE framework can be applied to various detection architectures, potentially benefiting other areas of audio and speech processing.
Data-efficient pre-training by scaling synthetic megadocs
NLP
Large Language Models
Efficient ML
- Synthetic data augmentation can improve pre-training efficiency when real data is scarce.
- Megadocs, formed by stitching or stretching documents, enhance loss scaling and model performance.
- Data efficiency improves from 1.48× to 1.80× with optimal synthetic generation strategies.
- The benefits of synthetic data algorithms compound with existing data-efficient strategies like ensembling.
Read more
Data-efficient pre-training by scaling synthetic megadocs
Summary
This paper explores the use of synthetic data augmentation to enhance data efficiency in pre-training language models, particularly when the availability of real data is limited. The authors demonstrate that mixing web data with synthetically generated rephrases can lower the i.i.d. validation loss, even when the synthetic data comes from a different distribution. They introduce the concept of 'megadocs', which are constructed by stitching synthetic rephrases from the same document or by inserting rationales to stretch a document. These megadocs significantly improve loss scaling and downstream benchmark performance compared to simple rephrasing. The study finds that data efficiency can be improved from 1.48× to 1.80× at 32 generations per document, with the benefits of megadocs becoming more pronounced as the amount of synthetic data increases. The results suggest that synthetic data algorithms can be designed to leverage increasing compute resources effectively, leading to better modeling of the underlying data distribution.
Methodology
The authors conducted experiments using a fixed dataset of 200M tokens and a synthetic data generator. They evaluated the impact of mixing real and synthetic data on i.i.d. validation loss, employing two methods to create megadocs: stitching rephrases from the same document and inserting rationales to extend documents. They measured loss changes as a function of the number of synthetic generations and tuned epoching and mixing strategies to optimize performance.
Results
The study found that pre-training with synthetic rephrases improved i.i.d. validation loss, achieving a plateau of 3.41 loss at 32 rephrases compared to a baseline of 3.55. Megadocs achieved even better results, with data efficiency reaching 1.80× at 32 generations. Downstream benchmarks showed an average accuracy improvement of 5% to 9% over the real data-only baseline.
Implications
The findings suggest that synthetic data can be a powerful tool for enhancing the efficiency of language model pre-training, particularly in scenarios where real data is limited. This approach could be applied in various NLP tasks, improving model performance and reducing the need for extensive real data collection.
Domain-informed explainable boosting machines for trustworthy lateral spread predictions
Interpretability
- Introduces a domain-informed framework to enhance the reliability of EBMs in predicting lateral spreading.
- Corrects non-physical relationships in EBMs by modifying shape functions based on domain knowledge.
- Demonstrates the application of the framework on the 2011 Christchurch earthquake dataset.
- Achieves more physically consistent predictions with an acceptable trade-off in accuracy.
Read more
Domain-informed explainable boosting machines for trustworthy lateral spread predictions
Summary
This paper introduces a novel framework that enhances Explainable Boosting Machines (EBMs) by incorporating domain knowledge to improve the physical consistency of predictions related to lateral spreading during earthquakes. Traditional EBMs can learn non-physical relationships due to data imbalances, which can lead to unreliable predictions in natural hazard contexts. The proposed method modifies the learned shape functions of EBMs based on established domain knowledge, correcting these non-physical behaviors while preserving valid data-driven patterns. The authors apply their approach to a dataset from the 2011 Christchurch earthquake, demonstrating that their modified model yields more physically consistent predictions, albeit with a slight reduction in accuracy (4-5%). The study addresses critical questions regarding the systematic incorporation of domain knowledge into machine learning models and the trade-offs between predictive accuracy and physical consistency.
Methodology
The authors utilize Explainable Boosting Machines (EBMs), which are generalized additive models, to predict lateral spreading. They modify the learned shape functions by fitting monotonic curves to regions where the relationships are physically plausible and synthesizing alternatives for bivariate interactions. This targeted modification is guided by domain knowledge to correct non-physical behaviors while retaining valid data-driven patterns.
Results
The modified EBM model shows improved physical consistency in predictions of lateral spreading when applied to the Christchurch earthquake dataset. The adjustments made to the shape functions correct previously identified non-physical trends, leading to more reliable global and local explanations, with only a modest decrease in predictive accuracy (4-5%).
Implications
The findings suggest that integrating domain knowledge into machine learning models can significantly enhance their reliability in critical applications such as natural hazard predictions. This approach could be extended to other domains where physical consistency is paramount, potentially improving decision-making processes in disaster management and risk assessment.
Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model
Theory
Efficient ML
Time Series
- Introduces a novel training strategy exploiting translation invariance in the FHN model.
- Benchmarks seven NO architectures, revealing performance trade-offs in training and inference.
- CNOs perform well on translated dynamics but require higher training costs.
- FNOs achieve low training error but have high inference time and less accuracy on translated dynamics.
Read more
Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model
Summary
This paper explores the capabilities of Neural Operators (NOs) in modeling the FitzHugh-Nagumo (FHN) system, which is significant in computational electrophysiology for simulating excitable cells. The author introduces a novel training strategy that leverages the translation invariance property of the FHN model, allowing for the generation of training datasets with varying spatial locations and intensities of applied current while keeping time fixed. The study benchmarks seven different NO architectures, including Convolutional Neural Operators (CNOs), Deep Operator Networks (DONs), and Fourier Neural Operators (FNOs), among others, assessing their performance based on training and test accuracy, computational efficiency, and inference speed. The findings indicate that while CNOs excel in handling translated dynamics, they incur higher training costs. Conversely, FNOs achieve lower training errors but exhibit slower inference times and less accuracy in translated scenarios. DONs and their variants demonstrate efficiency but struggle with generalization to the test set. This comprehensive evaluation highlights the strengths and limitations of NOs in capturing complex dynamics of ionic models, paving the way for future research in this domain.
Methodology
The study employs a novel training strategy for NOs that generates datasets with varying applied current spatial locations and intensities while maintaining a fixed time. The performance of seven NO architectures is evaluated based on their training and test accuracy, computational efficiency, and inference speed, particularly in the context of translated dynamics.
Results
The results indicate that CNOs are effective for translated dynamics but require more training resources. FNOs have the lowest training error but the highest inference time, and they struggle with accuracy in translated scenarios. DONs and their variants are efficient in training and inference but do not generalize well to the test set.
Implications
The findings suggest that while NOs can capture complex dynamics of the FHN model, there are significant trade-offs in terms of computational cost and generalization capabilities. This research could inform future developments in machine learning applications for computational electrophysiology and related fields.
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
NLP
Large Language Models
Reinforcement Learning
Theory
Efficient ML
- Autocurriculum improves training efficiency for reasoning models by adaptively selecting prompts.
- The proposed method reduces the number of required reasoning demonstrations exponentially compared to non-adaptive approaches.
- In reinforcement learning, autocurriculum decouples computational costs from model accuracy, enhancing training efficiency.
- The study provides a theoretical framework for understanding the benefits of autocurriculum in language model training.
Read more
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
Summary
This paper investigates the concept of autocurriculum in the context of training reasoning models, particularly focusing on the efficiency of supervised fine-tuning (SFT) and reinforcement learning (RL). The authors propose that autocurriculum, which allows models to adaptively select training prompts based on their performance, can significantly reduce the data and computational costs associated with training reasoning models. They demonstrate that autocurriculum can lead to exponential improvements in sample efficiency for SFT by concentrating on prompts where the model struggles, thus requiring fewer reasoning demonstrations. For RL, autocurriculum decouples the computational cost from the quality of the reference model, making the training process more efficient. The theoretical framework established in this study provides insights into how reasoning models can effectively design their own curricula, yielding substantial benefits in both statistical and computational aspects without relying on strict assumptions about the distribution of prompts.
Methodology
The authors develop an algorithm called AutoTune for supervised fine-tuning, inspired by classical boosting and learning from counterexamples. They analyze the performance of autocurriculum in both SFT and reinforcement learning with verifiable rewards (RLVR), establishing theoretical guarantees on the efficiency gains from adaptive data selection.
Results
The results show that autocurriculum can lead to exponential reductions in the number of reasoning demonstrations required for effective training, with the number of teacher demonstrations becoming nearly independent of the target accuracy. In the RL setting, the computational cost is significantly reduced, allowing for more efficient training of reasoning models.
Implications
The findings suggest that implementing autocurriculum can make the training of reasoning models more feasible and less resource-intensive, potentially enabling broader applications of advanced reasoning capabilities in various domains such as code generation and mathematical reasoning.
Attention Sinks Induce Gradient Sinks
Large Language Models
Theory
Optimization
- Introduces the concept of gradient sinks as a mechanism linking attention sinks and massive activations.
- Demonstrates that attention sinks induce pronounced gradient concentration during backpropagation.
- Shows that massive activations can be interpreted as an adaptive response to gradient pressure.
- Presents V-scale, a modification that adjusts backpropagated gradients, leading to preserved attention sinks and suppressed massive activations.
Read more
Attention Sinks Induce Gradient Sinks
Summary
This paper investigates the relationship between attention sinks (AS) and massive activations (MA) in Transformer models, particularly during the backpropagation phase of training. The authors introduce the concept of gradient sinks (GS), which occur when attention sinks lead to concentrated gradients during backpropagation. They argue that in pre-norm architectures with RMSNorm, massive activations are not merely a byproduct but an adaptive response to the localized gradient pressure caused by attention sinks. The study employs a new modification called V-scale to adjust backpropagated gradients, demonstrating that while attention sinks are preserved, massive activations can be suppressed. This supports the hypothesis that gradient sinks serve as a crucial training-time mediator linking attention sinks and massive activations, providing a deeper understanding of the dynamics in Transformer models.
Methodology
The authors conducted empirical and theoretical analyses of the backpropagation dynamics in pre-norm Transformer architectures. They introduced the V-scale modification to adjust the backpropagation of gradients and tracked gradient concentrations during training. The study also involved formalizing the Transformer architecture and defining metrics for measuring attention sinks and massive activations.
Results
The results indicate that attention sinks lead to significant gradient concentration, termed gradient sinks, which are particularly pronounced in attention blocks. The introduction of V-scale allowed for the retention of attention sinks while suppressing massive activations, thereby validating the hypothesis that gradient sinks mediate the relationship between attention sinks and massive activations.
Implications
The findings suggest that understanding the interplay between attention sinks and gradient sinks can lead to improved training strategies for Transformer models, potentially enhancing their performance and stability. This could have implications for optimizing large language models and other applications relying on Transformer architectures.
Automatic Configuration of LLM Post-Training Pipelines
Large Language Models
Reinforcement Learning
Optimization
- Introduction of AutoPipe, a budget-aware framework for LLM post-training configuration selection.
- Development of a dataset-conditioned ranking surrogate that enhances guidance across datasets.
- Implementation of an online adaptation mechanism using Gaussian-process modeling and an early-stop predictor.
- Demonstrated effectiveness of AutoPipe in reducing computational costs while achieving competitive performance.
Read more
Automatic Configuration of LLM Post-Training Pipelines
Summary
The paper addresses the challenges of configuring large language model (LLM) post-training pipelines, which combine supervised fine-tuning (SFT) and reinforcement learning (RL) under realistic compute budgets. The authors propose AutoPipe, a two-stage framework that optimizes configuration selection. In the offline phase, AutoPipe learns a dataset-conditioned learning-to-rank surrogate from historical runs, capturing preferences and guiding the search for promising configurations. In the online phase, it employs Bayesian optimization to adapt to new datasets, using early-stopped evaluations and a learned predictor to estimate final performance. The approach effectively reduces computational costs while maintaining high performance. Experiments on biomedical reasoning tasks demonstrate that AutoPipe outperforms offline-only baselines and achieves comparable results to leading online hyperparameter optimization (HPO) methods at less than 10% of their computational cost.
Methodology
AutoPipe employs a two-phase approach: an offline phase that learns a ranking surrogate from historical data to guide configuration selection, and an online phase that uses Bayesian optimization to adapt to new datasets while leveraging early-stopped evaluations to minimize costs. A learned predictor is used to map early training signals to a proxy for final performance.
Results
AutoPipe consistently outperformed offline-only baselines and achieved performance comparable to the strongest online HPO methods while utilizing less than 10% of their computational resources in experiments conducted on biomedical reasoning tasks.
Implications
The proposed framework can significantly streamline the configuration process for LLM post-training, making it more accessible and efficient for practitioners. It has potential applications in various domains requiring LLMs, particularly where computational resources are limited.
TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting
Time Series
- TimeAPN effectively models non-stationary factors in time and frequency domains.
- The framework captures rapid changes in amplitude and phase for improved forecasting.
- TimeAPN integrates amplitude and phase information through an adaptive normalization mechanism.
- Extensive experiments show significant improvements in forecasting accuracy over state-of-the-art methods.
Read more
TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting
Summary
The paper addresses the challenge of non-stationarity in multivariate long-term time series forecasting, which often results in rapid changes in amplitude and phase, leading to distribution shifts and degraded predictive performance. The authors propose TimeAPN, an Adaptive Amplitude-Phase Non-Stationarity Normalization framework that models and predicts non-stationary factors in both time and frequency domains. TimeAPN jointly models the mean sequence in these domains and forecasts its evolution, while also capturing phase discrepancies to address temporal misalignment. An adaptive normalization mechanism incorporates amplitude information to handle abrupt fluctuations in signal energy. The framework is model-agnostic and can be integrated with various forecasting models. Extensive experiments on seven real-world datasets demonstrate that TimeAPN consistently enhances long-term forecasting accuracy across multiple prediction horizons, outperforming existing normalization methods.
Methodology
TimeAPN employs discrete wavelet transform (DWT) to decompose time series into frequency components, estimating the mean sequence adaptively for energy compensation. It models phase discrepancies and integrates amplitude information through a collaborative de-normalization process.
Results
The proposed TimeAPN framework consistently outperforms existing normalization methods, achieving lower prediction errors across multiple datasets and prediction horizons, demonstrating its effectiveness in handling non-stationarity in time series forecasting.
Implications
The TimeAPN framework can be applied to various domains requiring time series forecasting, such as finance, energy management, and traffic control, potentially leading to more accurate predictions in complex, non-stationary environments.
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Interpretability
- Extreme sparsification leads to significant interpretability collapse, despite stable global representation quality.
- Local interpretability metrics deteriorate while global disentanglement metrics remain stable under compression.
- The phenomenon of interpretability collapse is intrinsic to the sparsification process, not dependent on specific algorithms or training durations.
- Dead neuron rates increase with dataset complexity, indicating greater challenges for real-world applications.
Read more
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Summary
This paper investigates the relationship between extreme neural network sparsification and mechanistic interpretability, particularly focusing on whether interpretable features survive under severe compression. The authors introduce a hybrid architecture combining Variational Autoencoders (VAE) and Sparse Autoencoders (SAE) and propose an adaptive sparsity scheduling framework that reduces active neurons from 500 to 50 over 50 training epochs. The study reveals a paradox where global representation quality, measured by Mutual Information Gap, remains stable, yet local feature interpretability collapses significantly. The experiments conducted on two benchmark datasets, dSprites and Shapes3D, demonstrate that under Top-k sparsification, dead neuron rates reach 34.4% on dSprites and 62.7% on Shapes3D. L1 regularization yields similar or worse results, with dead neuron rates of 41.7% on dSprites and 90.6% on Shapes3D. Extended training does not recover dead neurons, indicating that interpretability collapse is intrinsic to the compression process. The findings suggest that interpretability degradation scales with dataset complexity, highlighting a critical challenge for deploying interpretable AI in resource-constrained environments.
Methodology
The authors employed a hybrid VAE-SAE architecture and introduced an adaptive sparsity scheduling framework to progressively reduce active neurons. They conducted experiments on two datasets, dSprites and Shapes3D, using both Top-k and L1 sparsification methods to analyze the effects on interpretability and representation quality.
Results
The study found that under Top-k sparsification, dead neuron rates were 34.4% on dSprites and 62.7% on Shapes3D, while L1 regularization resulted in 41.7% and 90.6% dead neurons, respectively. Extended training for 100 epochs did not recover dead neurons, confirming the irreversibility of interpretability collapse. The collapse was also shown to scale with dataset complexity, with Shapes3D exhibiting significantly more dead neurons than dSprites.
Implications
These findings have significant implications for the deployment of interpretable AI systems, particularly in resource-constrained environments where extreme sparsification is necessary. The results highlight the need for new approaches to maintain interpretability in compressed neural networks.
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
Reinforcement Learning
Large Language Models
Theory
- CausalRM provides a scalable alternative to traditional reward modeling by utilizing observational user feedback.
- The framework addresses noise and bias in observational feedback through innovative loss modeling and reweighting techniques.
- Extensive experiments confirm CausalRM's effectiveness across multiple LLMs and datasets, leading to substantial performance gains.
- The work formalizes the problem of observational reward modeling, highlighting its significance in RLHF applications.
Read more
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
Summary
The paper introduces CausalRM, a novel framework for reward modeling in reinforcement learning from human feedback (RLHF) that leverages observational user feedback such as clicks and upvotes. Traditional RLHF methods rely on costly experimental feedback, which limits scalability. CausalRM addresses two main challenges: the noise in observational feedback due to user annotation errors and the bias introduced by user preferences. To mitigate these issues, CausalRM employs a noise-aware surrogate loss term that models the error generation process and uses propensity scores to reweight training samples, effectively counteracting user preference bias. The framework is validated through extensive experiments across various large language model (LLM) architectures and benchmark datasets, demonstrating its ability to learn accurate reward signals from noisy and biased data. The results indicate significant performance improvements in downstream RLHF tasks, showcasing CausalRM's potential for scalable and adaptive alignment of LLMs with human values.
Methodology
CausalRM employs a causal-theoretic approach to reward modeling, introducing a noise-aware surrogate loss term to account for user annotation errors and utilizing propensity scores to reweight training samples, thus addressing user preference bias. This dual approach allows for unbiased learning from observational feedback.
Results
CausalRM demonstrated a 49.2% improvement on the WildGuardMix benchmark and a 32.7% improvement on HarmBench, indicating its effectiveness in learning accurate reward signals from noisy and biased observational feedback.
Implications
The findings suggest that CausalRM can significantly enhance the scalability and adaptability of RLHF systems, making it feasible to align large language models with evolving user preferences without the need for costly experimental feedback collection.
Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention
Optimization
Interpretability
- Proposes a novel framework integrating explainable machine learning and mixed-integer optimization for sleep interventions.
- Achieves high predictive accuracy and identifies key behavioral factors influencing sleep quality.
- Generates personalized recommendations that consider individual constraints and the feasibility of behavioral changes.
- Demonstrates a trade-off between expected improvement and intervention intensity, highlighting diminishing returns.
Read more
Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention
Summary
This paper addresses the challenge of improving sleep quality among university students by proposing a personalized predictive-prescriptive framework that combines explainable machine learning with mixed-integer optimization. The authors highlight that traditional approaches to sleep interventions often fail to consider the complex interplay of behavioral, environmental, and psychosocial factors affecting sleep. Their framework utilizes a supervised classifier trained on survey data to predict sleep quality and employs SHAP-based feature attribution to identify modifiable factors influencing sleep. These insights are then integrated into a mixed-integer optimization model that generates actionable recommendations for behavioral adjustments, taking into account the resistance to change through a penalty mechanism. The framework demonstrates strong predictive performance, achieving an F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses reveal a trade-off between expected improvement and intervention intensity, suggesting that minimal changes can yield significant benefits. The model often recommends one or two high-impact adjustments, emphasizing the importance of personalized and feasible interventions in enhancing sleep quality.
Methodology
The methodology involves training a supervised classifier on survey data to predict sleep quality, utilizing SHAP for feature attribution to identify influential factors. These factors are then incorporated into a mixed-integer optimization model that generates personalized behavioral recommendations while accounting for resistance to change.
Results
The framework achieved a test F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses indicated a clear trade-off between expected improvement and intervention intensity, with the model often suggesting one or two high-impact behavioral adjustments.
Implications
The proposed framework has significant implications for developing personalized sleep interventions that are both actionable and feasible, particularly for vulnerable populations like university students. It emphasizes the need for data-driven insights to inform structured decision support in health interventions.
A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction
Theory
Optimization
Efficient ML
- Proposes a model ensemble-based framework for fairness-aware predictions.
- Framework is model-agnostic and applicable across various learning tasks.
- Enhances fairness while minimally affecting predictive accuracy.
- Extends applicability to survival analysis, a critical area in machine learning.
Read more
A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction
Summary
This paper addresses the challenge of balancing predictive performance and fairness in machine learning. The authors propose a novel post-processing framework that utilizes model ensembling to enhance fairness in predictions across various tasks, including classification, regression, and survival analysis. The framework operates independently of specific model internals, making it widely applicable. The authors highlight the limitations of existing fairness methods, particularly in post-processing approaches that are often tailored to specific tasks or fairness definitions. They introduce an ensemble-based method that combines predictions from a complex pre-trained model and a simpler model, allowing for the tuning of ensemble weights to promote fairness while maintaining predictive accuracy. This approach is particularly beneficial for survival analysis, where traditional fairness methods may struggle due to the time-dependent nature of outputs. The proposed framework is inspired by mixture modeling principles, allowing for efficient optimization and flexibility in addressing fairness across different domains.
Methodology
The authors developed a post-processing framework that combines predictions from a complex pre-trained model and a simpler model through weighted averaging. This ensemble approach allows for tuning weights to enhance fairness without altering the original model. The methodology is designed to be model-agnostic and applicable to various tasks, including survival analysis.
Results
Extensive experiments demonstrated that the proposed framework effectively improves fairness metrics while maintaining or only slightly impacting predictive accuracy across classification, regression, and survival analysis tasks.
Implications
The proposed framework has significant implications for ensuring fairness in machine learning applications across critical domains such as finance, healthcare, and criminal justice, where biased predictions can have serious ethical and social consequences. It provides a flexible solution for integrating fairness into existing models without extensive retraining.
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Optimization
Efficient ML
Generative Models
- SOL-ExecBench benchmarks GPU kernels against hardware Speed-of-Light limits rather than software baselines.
- The benchmark includes 235 CUDA kernel optimization problems from diverse AI applications.
- SOL Score measures the performance gap closure between a kernel and analytically derived SOL bounds.
- A sandboxed evaluation harness ensures reliable and reproducible benchmarking results.
Read more
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Summary
The paper introduces SOL-ExecBench, a novel benchmarking framework designed to evaluate the performance of GPU kernels against hardware limits rather than traditional software baselines. It addresses the growing need for effective benchmarking in the context of rapidly evolving AI models and GPU architectures. The benchmark consists of 235 CUDA kernel optimization problems derived from 124 production and emerging AI models, covering a range of applications including language processing, diffusion, vision, audio, and video. A key innovation is the use of Speed-of-Light (SOL) bounds, which are analytically derived performance limits based on hardware capabilities, providing a fixed target for optimization. The SOL Score quantifies how much a kernel closes the gap between a defined scoring baseline and the SOL bounds. Additionally, the framework includes a sandboxed evaluation environment to ensure reliable scoring and mitigate reward-hacking behaviors. This approach reframes GPU kernel benchmarking, emphasizing the importance of achieving hardware-efficient execution in the face of increasing model complexity and the rapid introduction of new GPU features.
Methodology
The authors developed SOL-ExecBench by extracting computational subgraphs from various AI models and curating benchmark problems. They implemented a pipeline called SOLAR to analytically derive Speed-of-Light bounds based on FLOP counts, byte counts, and GPU throughput. The evaluation includes a sandboxed environment to prevent reward-hacking and ensure reliable performance assessments.
Results
The introduction of SOL-ExecBench provides a comprehensive benchmarking framework that allows for the evaluation of GPU kernels against fixed hardware limits. The SOL Score effectively quantifies the optimization potential of kernels, revealing the extent to which they can approach hardware efficiency. The framework has been tested with a variety of AI models, demonstrating its applicability across different domains.
Implications
SOL-ExecBench has significant implications for the development of AI systems, as it encourages the optimization of GPU kernels towards hardware efficiency. This can lead to improved performance in data centers and better utilization of GPU capabilities, ultimately enhancing the efficiency of AI model training and inference.
Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates
Time Series
- Baguan-TS enables in-context learning directly on raw multivariate time series without feature engineering.
- The model employs a 3D Transformer architecture that attends across temporal, variable, and context axes.
- A Y-space retrieval-based calibration module enhances model stability and forecasting accuracy.
- The context-overfitting strategy improves robustness by balancing denoising and selection of relevant support examples.
Read more
Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates
Summary
The paper introduces Baguan-TS, a novel framework that integrates in-context learning (ICL) with raw sequence representation for time series forecasting, addressing limitations of existing models that rely on tabularized features. Baguan-TS employs a 3D Transformer architecture that simultaneously attends to temporal, variable, and context dimensions, enabling efficient adaptation to new tasks without the need for gradient updates. The authors tackle two significant challenges: calibration and output oversmoothing. They propose a Y-space retrieval-based local calibration mechanism to enhance model stability and accuracy, and a context-overfitting strategy to balance denoising and selection of support examples. Experimental results demonstrate that Baguan-TS outperforms established baselines on public benchmarks, achieving superior forecasting metrics and robustness across various real-world datasets, particularly in scenarios with covariates.
Methodology
Baguan-TS utilizes a 3D Transformer architecture designed for time series data, which allows for joint attention across temporal, variable, and context dimensions. The framework incorporates a Y-space retrieval-based local calibration module for improved model calibration and stability, as well as a context-overfitting strategy to mitigate output oversmoothing by focusing on relevant support examples during inference.
Results
Baguan-TS consistently outperformed established baselines on the fev-bench-cov benchmark, achieving the best average scaled quantile loss (SQL) and mean absolute scaled error (MASE). It demonstrated a 4.8% reduction in SQL compared to the TabPFN-TS model, indicating significant improvements in both point and probabilistic forecasting metrics across various datasets.
Implications
The development of Baguan-TS has significant implications for time series forecasting, particularly in applications requiring rapid adaptation to new tasks and robustness under distribution shifts. Its ability to operate without extensive feature engineering makes it suitable for real-world scenarios where data is limited or noisy.
One-Step Sampler for Boltzmann Distributions via Drifting
Generative Models
Optimization
Efficient ML
- Introduces a drifting-based framework for sampling Boltzmann distributions.
- Develops a one-step neural generator that simplifies the sampling process.
- Utilizes Gaussian-smoothed score fields to project samples towards target distributions.
- Achieves low error rates on complex target distributions, demonstrating efficiency.
Read more
One-Step Sampler for Boltzmann Distributions via Drifting
Summary
This paper introduces a novel framework for amortized sampling of Boltzmann distributions using a drifting-based approach. The authors propose a one-step neural generator that projects samples from the current model distribution towards the target Boltzmann distribution along a Gaussian-smoothed score field. The method addresses the challenge of sampling from distributions defined by energy functions, particularly when the normalization constant is unknown. By deriving a practical target-side drift from smoothed energy, the authors utilize two estimators: a local importance-sampling mean-shift estimator and a second-order curvature-corrected approximation. The combination of these estimators with a mini-batch Gaussian mean-shift estimate results in a stable stop-gradient objective for training. The proposed sampler demonstrates effective performance on various target distributions, including a four-mode Gaussian mixture, achieving low mean and covariance errors. The results indicate that the drifting approach can efficiently transform iterative sampling processes into a single forward pass, making it suitable for applications requiring rapid sampling from complex distributions.
Methodology
The methodology involves formulating the sampling problem as a drifting problem, where a vector field is defined to transport samples from the current distribution towards the target distribution. The authors use Gaussian-smoothed score operators to derive a drift field, which is then used to create a training objective that minimizes the discrepancy between the smoothed scores of the target and the sampler. Two approximations for the target-side drift are introduced, allowing for efficient computation and training of the neural sampler.
Results
The proposed one-step sampler achieves a mean error of 0.0754, covariance error of 0.0425, and RBF MMD of 0.0020 on a four-mode Gaussian mixture target. The method also successfully handles nonconvex and curved low-energy geometries, as demonstrated on additional double-well and banana targets, indicating its robustness across different distribution shapes.
Implications
The findings suggest that the drifting approach can significantly enhance the efficiency of sampling from complex distributions, making it applicable in fields such as molecular modeling, Bayesian inference, and generative modeling where rapid sampling is crucial. This method could lead to advancements in various applications that rely on accurate and efficient sampling techniques.
Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration
Reinforcement Learning
Robotics
Theory
- Introduces vector-field reward shaping to enhance exploration in offline RL.
- Combines gradient-alignment and rotational-flow terms for effective boundary exploration.
- Theoretical analysis supports the efficacy of the proposed reward structure.
- Empirical results demonstrate successful navigation of uncertainty boundaries.
Read more
Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration
Summary
This paper addresses the challenges of offline reinforcement learning (RL) in safely exploring novel states during online deployment. Offline RL typically suffers from pessimism, which restricts exploration and can hinder the collection of new data. The authors propose a novel vector-field reward shaping approach that encourages agents to explore near the boundaries of known regions while maintaining safety. This reward structure consists of two components: a gradient-alignment term that guides the agent towards a target uncertainty level and a rotational-flow term that promotes movement along the uncertainty manifold. Theoretical analysis demonstrates that this reward design fosters sustained exploration without leading to degenerate behaviors, such as 'parking' at the frontier. Empirical validation using Soft Actor-Critic on a 2D navigation task shows that agents can effectively traverse uncertainty boundaries, balancing safe data collection with task completion. The proposed method allows for safe deployment of pre-trained policies, enabling agents to gather informative data without risking catastrophic failures.
Methodology
The authors developed a vector-field reward shaping paradigm that operates on an uncertainty oracle trained from offline data. The reward structure includes a gradient-alignment term to attract agents towards a target uncertainty level and a rotational-flow term to encourage movement along the local tangent plane of the uncertainty manifold. This approach was integrated with the Soft Actor-Critic algorithm and tested in a 2D continuous navigation task.
Results
The integration of the proposed reward shaping with Soft Actor-Critic led to agents successfully exploring uncertainty boundaries while effectively balancing safe data collection and task completion. The theoretical framework provided guarantees against degenerate behaviors, ensuring continuous exploration.
Implications
The proposed method has significant implications for real-world applications where safety is critical, such as autonomous driving and robotics. By enabling agents to safely explore and collect data, this approach can improve the performance of RL systems in dynamic environments without the risks associated with online updates.
Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
Reinforcement Learning
Optimization
Theory
- Development of an exact dynamic programming oracle for blackjack, providing a rigorous benchmark for policy optimization.
- Evaluation of three model-free optimizers (REINFORCE, SPSA, CEM) in recovering optimal policies under dynamically masked actions.
- REINFORCE demonstrated the highest sample efficiency, but all methods faced significant policy-level errors.
- Establishment of a minimum-bet optimality theorem, confirming that optimal betting strategies under no-count conditions lead to minimum wagers.
Read more
Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
Summary
This paper presents a comprehensive evaluation of model-free policy optimization techniques in the context of casino blackjack, utilizing an exact dynamic programming (DP) oracle as a benchmark. The study focuses on the infinite-shoe model of blackjack, which allows for rigorous testing of policy recovery methods under dynamically masked actions. An exact DP oracle was developed, yielding ground-truth action values and optimal policy labels across 4,600 decision cells, with a theoretical expected value (EV) of -0.00161 per hand. Three model-free optimizers—masked REINFORCE, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM)—were trained through simulated interactions. The results indicated that REINFORCE was the most sample-efficient method, achieving a 46.37% action-match rate and an EV of -0.04688 after 106 hands, outperforming CEM and SPSA in terms of sample efficiency. However, all methods displayed significant cell-conditional regret, highlighting the challenges posed by state-visitation sparsity and dynamic action masking. The paper also established a minimum-bet optimality theorem, confirming that optimal bet sizing under no-count conditions collapses to the table minimum, thus providing a robust negative control for the simulation framework. Overall, the findings emphasize the necessity of exact oracles and negative controls in evaluating algorithmic performance in complex environments.
Methodology
The study employed an infinite-shoe model of blackjack, using a dynamic programming oracle to derive optimal policies and action values. Three model-free optimization techniques were implemented: masked REINFORCE with a baseline, SPSA, and CEM. The performance of these methods was evaluated based on action-match rates, convergence, and cell-conditional regret across 4,600 decision cells.
Results
REINFORCE achieved a 46.37% action-match rate and an expected value of -0.04688 after 106 hands, outperforming CEM (39.46%, 7.5 × 10^6 evaluations) and SPSA (38.63%, 4.8 × 10^6 evaluations). All methods exhibited substantial cell-conditional regret, indicating persistent errors despite smooth reward convergence. The minimum-bet optimality theorem was validated, confirming that optimal betting strategies lead to minimum wagers under no-count conditions.
Implications
The findings underscore the complexities of policy optimization in environments with masked actions and sparse state visitation. They suggest that exact oracles and negative controls are essential for accurately assessing the performance of reinforcement learning algorithms, particularly in stochastic control tasks. This work could inform future research in reinforcement learning and its applications in complex decision-making scenarios.
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
Reinforcement Learning
Theory
Optimization
- Introduction of a benchmarking framework based on stochastic converse optimality.
- Systematic generation of benchmark families with known optimal policies.
- Validation through the automatic construction of diverse environments.
- Provision of a reproducible foundation for evaluating RL algorithms against certified optima.
Read more
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
Summary
This paper addresses the challenges of benchmarking Reinforcement Learning (RL) algorithms, which are often influenced by environmental design and stochasticity. The authors propose a new benchmarking framework that extends the concept of converse optimality to stochastic, discrete-time, control-affine, nonlinear systems. This framework allows for the systematic generation of benchmark families with known optimal policies, enabling a more rigorous evaluation of RL algorithms. The authors validate their approach by constructing diverse environments and demonstrating its effectiveness in providing a controlled evaluation across various RL methods. By comparing standard RL methods against a ground-truth optimum, the framework aims to establish a reproducible foundation for precise RL benchmarking, addressing the limitations of existing benchmarks that lack certified optimal solutions.
Methodology
The authors extend the concept of converse optimality to stochastic systems, providing necessary and sufficient conditions for optimality. They develop a theorem for systems with additive Gaussian noise and a Quadratic-Gaussian corollary. The framework includes a constructive benchmark generation method using homotopy over control strength and a paired evaluation protocol that utilizes common random numbers for absolute performance metrics.
Results
The proposed framework successfully generates diverse environments with known optimal policies, allowing for a comprehensive evaluation of RL algorithms. The results demonstrate that the benchmarking approach can effectively assess optimality gaps and regret metrics against certified optima, providing a more reliable evaluation method compared to existing benchmarks.
Implications
This work has significant implications for the field of reinforcement learning, as it provides a systematic and reproducible method for benchmarking RL algorithms. By enabling absolute performance evaluation, it can help researchers better understand the strengths and weaknesses of different RL approaches, ultimately leading to more robust and effective algorithms.
Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction
NLP
Large Language Models
Reinforcement Learning
- Introduces Latent Logic Augmentation to enhance decision-making in LLMs.
- Develops a Multiple Ground Truths dataset to reduce training noise and capture semantic diversity.
- Presents a Hybrid Reward mechanism for efficient reinforcement learning.
- Demonstrates improved stability and performance in real-world Cloud service tasks.
Read more
Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction
Summary
This paper addresses the challenges of adapting Large Language Models (LLMs) for complex technical service domains, which are often hindered by the lack of explicit cognitive reasoning in human demonstrations and the ambiguity of valid responses. The authors propose a lightweight adaptation framework that includes three main contributions: (1) Latent Logic Augmentation, which enhances the model's decision-making capabilities through Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation; (2) Robust Noise Reduction, achieved by creating a Multiple Ground Truths dataset that captures semantic diversity and reduces noise in training data; and (3) Lightweight Adaptation, which introduces a Hybrid Reward mechanism that combines an LLM-based judge with a relevance-based Reranker to provide efficient reward signals for reinforcement learning. Empirical evaluations on real-world Cloud service tasks demonstrate that the proposed framework improves stability and performance while reducing training time compared to traditional methods.
Methodology
The authors developed a lightweight adaptation framework that includes Latent Logic Augmentation for enhancing decision reasoning, Robust Noise Reduction through a dual-filtering method to create a Multiple Ground Truths dataset, and a Hybrid Reward mechanism that integrates LLM-based judging with a lightweight Reranker for efficient reward signal generation.
Results
The framework was empirically validated on real-world Cloud service tasks, showing significant improvements in stability and performance. The Hybrid Reward mechanism provided alignment comparable to standard LLM-as-a-judge methods while reducing training time and computational costs.
Implications
The proposed framework has practical implications for deploying technical service agents in complex domains, enabling more efficient and effective adaptations of LLMs without the prohibitive costs associated with traditional training paradigms.
From ex(p) to poly: Gaussian Splatting with Polynomial Kernels
Computer Vision
Efficient ML
- Introduction of an N-th-order polynomial kernel approximation for Gaussian Splatting.
- Significant reduction in computational overhead while maintaining compatibility with existing datasets.
- Derivation of tighter bounding radii for aggressive splat culling, enhancing performance.
- Formal proof of invariance of anti-aliasing normalization factors for arbitrary kernel functions.
Read more
From ex(p) to poly: Gaussian Splatting with Polynomial Kernels
Summary
This paper addresses the challenges associated with the adoption of modified kernels in Gaussian Splatting (3DGS), particularly the incompatibility with existing datasets optimized for the original Gaussian kernel. The authors propose a new polynomial kernel approximation that enhances computational efficiency while maintaining compatibility with existing datasets. By replacing the original exponential kernel with a polynomial approximation combined with a ReLU function, the authors achieve significant performance improvements in 3DGS implementations. The paper includes a detailed mathematical analysis of the new kernel, demonstrating its advantages for real-time rendering on NPU hardware. The proposed method allows for aggressive culling of splats, resulting in performance gains of 4% to 15% with minimal impact on image quality. The authors also provide a methodology for fitting polynomial coefficients optimized for real-world rendering scenarios, ensuring that the new kernel is practical for various applications.
Methodology
The authors replace the original Gaussian kernel with a polynomial approximation and utilize a ReLU function to enhance performance. They derive tighter bounding radii for splats to facilitate aggressive culling and provide a fitting methodology for polynomial coefficients using L1 loss. The evaluation includes performance and quality assessments across various rendering APIs.
Results
The proposed polynomial kernel achieves performance improvements of 4% to 15% in 3DGS implementations, with negligible degradation in image quality. The methodology for fitting polynomial coefficients is shown to be effective in optimizing the kernel for real-world scenarios.
Implications
The findings suggest that the new polynomial kernel can facilitate broader adoption of Gaussian Splatting techniques in real-time rendering applications, particularly on NPU hardware, by improving efficiency without sacrificing compatibility with existing datasets.
AIMER: Calibration-Free Task-Agnostic MoE Pruning
NLP
Large Language Models
Efficient ML
- AIMER addresses the calibration dependence issue in task-agnostic MoE expert pruning.
- The proposed method allows for expert ranking without relying on calibration data.
- AIMER achieves superior expert stratification compared to traditional weight magnitude methods.
- The method demonstrates competitive performance across multiple benchmarks and models.
Read more
AIMER: Calibration-Free Task-Agnostic MoE Pruning
Summary
The paper introduces AIMER, a novel calibration-free criterion for pruning experts in Mixture-of-Experts (MoE) language models, addressing the limitations of existing calibration-dependent methods. Current expert pruning techniques rely on calibration sets to estimate expert importance, which can lead to inconsistent pruning outcomes based on the chosen calibration data. AIMER utilizes a simple scoring mechanism based on the mean absolute weight normalized by the root-mean-square value, providing clearer expert stratification and reducing the need for extensive preprocessing. The authors evaluate AIMER on various MoE models ranging from 7B to 30B parameters, demonstrating its effectiveness at pruning ratios of 25% and 50% across 16 benchmarks. The results indicate that AIMER consistently outperforms or matches state-of-the-art calibration-based methods while significantly decreasing expert-scoring time from hours to seconds. This advancement not only enhances the efficiency of MoE models but also simplifies their deployment by minimizing memory and serving overhead.
Methodology
AIMER scores each expert based on the mean absolute weight normalized by its root-mean-square value, allowing for effective expert ranking without calibration data. The method is evaluated on three MoE model families at different pruning ratios and compared against calibration-based baselines using a consistent calibration set.
Results
AIMER shows significant improvements, particularly on the ERNIE model at a 25% pruning ratio, with coding performance increasing by 29.7% and math performance by 13.3% compared to the best baseline. Overall, AIMER maintains competitive performance while reducing expert-scoring time to 0.22–1.27 seconds, compared to 0.75–2.96 hours for calibration-based methods.
Implications
The development of AIMER could lead to more efficient deployment of MoE models in various applications, reducing memory requirements and improving performance consistency across different tasks without the need for extensive calibration processes.
The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions
Theory
Generative Models
Optimization
- Establishes the topological limits of counterfactual interventions in continuous spaces.
- Introduces the Counterfactual Event Horizon, defining critical transport distances for causal interventions.
- Demonstrates that extreme interventions can lead to finite-time singularities (Manifold Tearing).
- Proposes Geometry-Aware Causal Flow (GACF) to manage geometric entropy during counterfactual generation.
Read more
The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions
Summary
This paper addresses the challenges of applying Judea Pearl's do-calculus to continuous, high-dimensional generative models, such as Score-based Diffusion Models and Flow Matching. The authors explore the topological and measure-theoretic limits of counterfactual interventions in continuous Riemannian domains, highlighting that such interventions lead to significant topological redistributions of probability measures. They introduce the concept of the Counterfactual Event Horizon, which defines a critical transport distance beyond which identity-preserving causal transport requires divergent control energy. The paper proves that interventions beyond this horizon lead to finite-time singularities, termed Manifold Tearing, governed by Riccati equations along geodesics. To address these challenges, the authors propose Geometry-Aware Causal Flow (GACF), an algorithmic framework that utilizes Hutchinson trace estimators to manage geometric entropy during interventions. The findings emphasize that deterministic generative counterfactuals are geometrically ill-posed for extreme out-of-distribution interventions, necessitating targeted entropic regularization for robust causal inference in continuous spaces.
Methodology
The authors formalize continuous interventions through measure disintegration and Gaussian mollification, circumventing the singular entropy paradox of Dirac measures. They analyze the behavior of optimal transport maps and derive mathematical bounds related to the Hessian of the Brenier optimal transport map, leading to insights on singularities and manifold tearing. The GACF framework is developed to dynamically inject geometric entropy when necessary.
Results
The paper proves that deterministic counterfactual generation is mathematically ill-posed under extreme interventions, leading to topological singularities. The introduction of GACF provides a scalable solution for managing these challenges, demonstrating that entropic regularization is essential for maintaining individual identity during counterfactual generation.
Implications
The findings have significant implications for the fields of causal inference and generative modeling, particularly in applications requiring high-dimensional counterfactual reasoning, such as genomics and medical imaging. The proposed methodologies could enhance the reliability of generative models in extreme scenarios.
SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training
Reinforcement Learning
Large Language Models
Robotics
- SLEA-RL retrieves experiences at each decision step, enhancing adaptability in multi-turn tasks.
- The framework employs observation clustering for efficient experience retrieval and generalization.
- A self-evolving experience library maintains quality through score-based admission and rate-limited extraction.
- SLEA-RL shows superior performance on benchmarks like ALFWorld and WebShop compared to traditional RL methods.
Read more
SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training
Summary
The paper introduces SLEA-RL, a novel framework for reinforcement learning that enhances the training of large language model (LLM) agents in multi-turn tool-use tasks. Traditional reinforcement learning methods often operate in isolation, failing to leverage experiences from previous episodes, which limits their learning efficiency. SLEA-RL addresses this by implementing step-level experience retrieval, allowing agents to access relevant experiences conditioned on their current observations at each decision step. The framework consists of three main components: (1) step-level observation clustering for efficient retrieval of structurally similar experiences, (2) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction, and (3) policy optimization with step-level credit assignment for fine-grained advantage estimation. The empirical results demonstrate that SLEA-RL significantly outperforms existing reinforcement learning baselines on long-horizon multi-turn benchmarks, showcasing its potential to improve agent performance in dynamic environments.
Methodology
SLEA-RL integrates step-level experience retrieval into the training loop, clustering observations to facilitate efficient retrieval of relevant experiences. It uses a self-evolving experience library that updates based on semantic analysis of trajectories, allowing for continuous improvement without relying solely on gradient updates. The framework also incorporates step-level credit assignment to better attribute success to specific actions taken during the episode.
Results
Experiments conducted on long-horizon multi-turn benchmarks, including ALFWorld and WebShop, reveal that SLEA-RL achieves faster convergence and higher success rates compared to various reinforcement learning baselines, demonstrating its effectiveness in leveraging accumulated experiences for improved agent performance.
Implications
The SLEA-RL framework has significant implications for the development of more adaptive and efficient LLM agents capable of handling complex, dynamic tasks. Its ability to leverage past experiences in real-time could enhance applications in areas such as interactive web search, automated planning, and other multi-turn decision-making scenarios.
A foundation model for electrodermal activity data
Time Series
- Introduction of EDAMAME, a large-scale EDA dataset from 24 public sources.
- Development of UME, the first foundation model specifically for EDA data.
- UME outperforms baseline models in 80% of scenarios while being 20 times more computationally efficient.
- Challenges in EDA modeling are acknowledged, indicating the need for further research.
Read more
A foundation model for electrodermal activity data
Summary
This paper presents a novel foundation model specifically designed for electrodermal activity (EDA) data, addressing the significant gap in large-scale, publicly available datasets for EDA. The authors introduce EDAMAME, a comprehensive collection of EDA traces compiled from 24 public datasets, totaling over 25,000 hours of data from 634 users. Utilizing this dataset, they train UME, the first dedicated foundation model for EDA. The model demonstrates superior performance in eight out of ten evaluation scenarios compared to baseline models, while also matching the performance of generalist time series foundation models, but with a computational efficiency that is 20 times greater. Despite these advancements, the study highlights ongoing challenges in EDA modeling, such as variability in balanced accuracy scores, which rarely exceed 0.7. The authors emphasize the need for further research to fully exploit the potential of EDA in various applications, and they provide all datasets, model weights, and code to facilitate future studies.
Methodology
The authors compiled a large-scale dataset (EDAMAME) from 24 public datasets, training the UME model on approximately 275 million 60-second windows of EDA data. They evaluated UME's performance against baseline models and generalist time series foundation models across various downstream tasks.
Results
UME outperformed baseline models in eight out of ten evaluation scenarios and matched the performance of generalist time series models while using significantly fewer computational resources (20 times less). However, the study found that balanced accuracy scores for EDA modeling often did not exceed 0.7, indicating inherent challenges in the data.
Implications
The findings suggest that UME can be effectively used in applications related to cognitive load, stress, and engagement assessment through EDA. The availability of the EDAMAME dataset and UME model can spur further research and development in physiological signal analysis and wearable technology.
SCALE: Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction
Generative Models
Efficient ML
Theory
- SCALE significantly improves the efficiency of virtual cell perturbation prediction through a specialized training and inference framework.
- The model utilizes a set-aware flow architecture for stable and biologically faithful predictions of perturbation effects.
- SCALE achieves notable performance improvements over existing models on the Tahoe-100M benchmark.
- The research emphasizes the need for co-designing scalable systems and biologically grounded evaluation metrics in virtual cell modeling.
Read more
SCALE: Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction
Summary
The paper introduces SCALE, a large-scale foundation model designed for predicting virtual cell perturbations, addressing significant challenges in the field such as inefficient training pipelines, unstable modeling in high-dimensional spaces, and evaluation metrics that do not adequately reflect biological fidelity. SCALE employs a BioNeMo-based framework to enhance data throughput and scalability, achieving a 12.51× speedup in pretraining and a 1.29× speedup in inference compared to previous state-of-the-art methods. The model formulates perturbation prediction as conditional transport using a set-aware flow architecture that combines LLaMA-based cellular encoding with endpoint-oriented supervision, leading to more stable training and improved recovery of perturbation effects. Evaluated on the Tahoe-100M benchmark, SCALE outperforms existing models, improving PDCorr by 12.02% and DE Overlap by 10.66%. The findings suggest that effective virtual cell modeling requires not only advanced generative objectives but also a cohesive approach to scalable infrastructure, stable transport modeling, and biologically relevant evaluation metrics.
Methodology
SCALE employs a BioNeMo-based framework for high-throughput data processing and distributed execution, integrating a LLaMA-style set encoder and a conditional flow-matching architecture to jointly learn set-level representations and perturbation-conditioned state transitions.
Results
On the Tahoe-100M benchmark, SCALE improves PDCorr by 12.02% and DE Overlap by 10.66% compared to the previous state-of-the-art model, STATE, indicating enhanced expression prediction and recovery of perturbation effects.
Implications
The advancements presented in SCALE could facilitate more accurate in silico experimentation in biological research, enabling better hypothesis generation and mechanism-oriented screening prior to costly wet-lab experiments.
AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Reinforcement Learning
Multimodal
Robotics
- AcceRL eliminates synchronization barriers by decoupling training, inference, and rollouts.
- The framework integrates a trainable world model, enhancing sample efficiency by 200x.
- AcceRL achieves state-of-the-art performance on the LIBERO benchmark.
- The architecture demonstrates super-linear scaling in throughput and efficient hardware utilization.
Read more
AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Summary
The paper introduces AcceRL, a novel asynchronous reinforcement learning (RL) framework designed to enhance the efficiency of Vision-Language-Action (VLA) models. Traditional RL approaches face significant challenges due to synchronization barriers that hinder computational efficiency and data acquisition. AcceRL addresses these issues by decoupling training, inference, and rollouts into independent asynchronous streams, thus maximizing hardware utilization and achieving super-linear scaling in throughput. A key innovation of AcceRL is the integration of a trainable world model, which allows for the generation of virtual experiences, significantly improving sample efficiency and training stability. Experimental results on the LIBERO benchmark demonstrate that AcceRL achieves state-of-the-art performance across various tasks, showcasing its effectiveness in complex control scenarios. The framework's ability to bypass the limitations of physical simulators by generating high-fidelity synthetic experiences marks a significant advancement in the field of embodied artificial intelligence.
Methodology
AcceRL employs a fully asynchronous architecture that isolates different components of the RL process. It incorporates a plug-and-play world model that generates synthetic experiences, allowing for continuous refinement of both the world model and the policy. The framework leverages advanced optimization techniques to maximize throughput and minimize latency, addressing the challenges posed by traditional synchronous RL frameworks.
Results
AcceRL achieved state-of-the-art performance on the LIBERO benchmark, demonstrating significant improvements in sample efficiency and training stability. The framework exhibited super-linear scaling in throughput with the number of trainer GPUs, indicating highly efficient hardware utilization. The world model's integration allowed for a remarkable 200x improvement in online sample efficiency compared to traditional methods.
Implications
The advancements presented in AcceRL have the potential to revolutionize the training of large-scale VLA models, enabling more efficient and effective learning in embodied AI applications. The ability to generate high-fidelity synthetic experiences could lead to broader applications in robotics, autonomous systems, and interactive AI agents, where real-world data acquisition is costly or impractical.
Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control
Reinforcement Learning
Time Series
- Introduces an adaptive framework for stock price prediction that identifies market regime shifts.
- Utilizes an autoencoder for weakly supervised anomaly detection based on reconstruction errors.
- Employs dual node transformers for specialized processing of stable and volatile market conditions.
- Incorporates a Soft Actor-Critic reinforcement learning controller for dynamic threshold adjustment.
Read more
Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control
Summary
This paper presents an innovative framework for stock price prediction that adapts to varying market regimes, addressing the limitations of traditional models that often fail during volatile periods. The proposed system employs an autoencoder to detect anomalies in market conditions by measuring reconstruction errors from normal market data. It utilizes dual node transformer networks, each tailored for stable and event-driven market conditions, to enhance prediction accuracy. A Soft Actor-Critic (SAC) reinforcement learning controller dynamically adjusts the regime detection threshold and the blending of prediction pathways based on real-time performance feedback. The framework was tested on 20 S&P 500 stocks from 1982 to 2025, achieving a mean absolute percentage error (MAPE) of 0.59% with the adaptive system, a significant improvement over the baseline model's 0.80%. The system also demonstrated robust performance during high-volatility periods, maintaining MAPE below 0.85%. Ablation studies indicated that each component of the framework contributes meaningfully to its overall performance, with the autoencoder routing being the most impactful.
Methodology
The methodology involves three main components: an autoencoder for anomaly detection trained on normal market data, dual node transformer networks designed for different market conditions, and a Soft Actor-Critic reinforcement learning controller that adapts the regime detection threshold and pathway blending based on prediction performance feedback.
Results
The proposed framework achieved a MAPE of 0.59% with the adaptive system, compared to 0.80% for the baseline model. Directional accuracy improved to 72% with the complete framework. The system maintained robust performance during high-volatility periods, with MAPE below 0.85%. Ablation studies revealed that the autoencoder routing contributed 36% to MAPE degradation when removed, followed by the SAC controller at 15% and the dual-path architecture at 7%.
Implications
The findings suggest that adaptive regime-aware models can significantly enhance stock price prediction accuracy, particularly in volatile market conditions. This approach could be applied to other financial forecasting tasks and may lead to more resilient trading strategies that dynamically adjust to changing market environments.
Context Bootstrapped Reinforcement Learning
Reinforcement Learning
Large Language Models
- CBRL effectively addresses exploration inefficiency in RLVR by using in-context learning.
- The method employs a curriculum-based approach for injecting few-shot demonstrations.
- CBRL shows consistent performance improvements across various tasks and model families.
- The approach is algorithm-agnostic, yielding benefits regardless of the underlying reinforcement learning algorithm.
Read more
Context Bootstrapped Reinforcement Learning
Summary
The paper introduces Context Bootstrapped Reinforcement Learning (CBRL), a novel approach designed to address the exploration inefficiency prevalent in Reinforcement Learning from Verifiable Rewards (RLVR). CBRL enhances the training process by incorporating few-shot demonstrations into the training prompts, which are stochastically injected with a high probability at the beginning of training and gradually reduced to zero. This method encourages models to internalize reasoning patterns rather than relying on external examples during testing. The authors validate CBRL across two model families and five Reasoning Gym tasks, demonstrating significant improvements in success rates and exploration efficiency. The approach is also tested on Q, a domain-specific programming language, showcasing its versatility and effectiveness in diverse contexts. The results indicate that CBRL consistently outperforms baseline methods, leading to better learning dynamics and robust performance even after the removal of demonstrations.
Methodology
CBRL integrates a bank of solved examples into the training process, using a stochastic injection mechanism to prepend these examples to training prompts. The probability of injection is high at the start of training and decreases over time, allowing the model to gradually rely less on external guidance. This method is validated through experiments on multiple tasks and programming languages, analyzing learning dynamics and performance metrics.
Results
CBRL consistently outperformed the GRPO-only baseline across all tested model-environment pairs, with accuracy improvements ranging from +1.3% to +22.3%. In the Q programming language, CBRL enhanced the test-pass rate from 27.3% to 43.0% and Pass@1 from 5.0% to 26.3%. The method demonstrated significant gains in exploration efficiency and learning dynamics, achieving higher mean rewards early in training.
Implications
The findings suggest that CBRL can be effectively applied to various reinforcement learning tasks, particularly those requiring novel reasoning patterns or domain-specific knowledge. Its algorithm-agnostic nature and adaptability to different contexts make it a valuable approach for enhancing the performance of large language models in complex reasoning scenarios.
Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks
Graph Learning
Optimization
Theory
- Introduction of unlearning corruption attacks that exploit the unlearning process in GNNs.
- Formulation of the attack as a bi-level optimization problem to address black-box challenges.
- Demonstration of significant accuracy degradation in GNNs due to carefully crafted unlearning requests.
- Highlighting the stealthy nature of the attack, which remains undetected during training.
Read more
Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks
Summary
This paper explores the vulnerabilities of Graph Neural Networks (GNNs) in the context of approximate graph unlearning, which is essential for complying with privacy regulations. The authors introduce a novel attack paradigm termed 'unlearning corruption attacks,' where adversaries inject carefully crafted nodes into the training graph and subsequently request their deletion. This process exploits the performance degradation that occurs during unlearning, leading to significant accuracy drops in the model. The authors formulate this attack as a bi-level optimization problem, addressing challenges such as black-box unlearning and label scarcity by using gradient-based updates and surrogate models for pseudo-label generation. Through extensive experiments, the authors demonstrate that even small, strategically designed unlearning requests can severely impact the performance of GNNs, highlighting critical concerns regarding the robustness of unlearning methods under regulatory demands.
Methodology
The authors propose a bi-level optimization framework to model the unlearning corruption attack. They approximate the unlearning process using gradient-based updates and utilize a surrogate model to generate pseudo-labels for unlabeled nodes, allowing for effective optimization of node injection strategies that maximize post-unlearning damage while maintaining stealthiness during training.
Results
The experiments conducted across various benchmarks and unlearning algorithms reveal that the proposed unlearning corruption attacks can induce substantial performance degradation in GNNs. The results underscore the vulnerability of GNNs to adversarial manipulation during the unlearning process, raising concerns about their reliability in real-world applications where privacy regulations are enforced.
Implications
The findings of this study have significant implications for the deployment of GNNs in privacy-sensitive applications. They highlight the need for more robust unlearning methods that can withstand adversarial attacks, ensuring compliance with privacy regulations while maintaining model performance. This research could inform the development of more secure graph learning frameworks and influence policy discussions around data privacy and model accountability.
Hierarchical Latent Structure Learning through Online Inference
Theory
Efficient ML
Time Series
- HOLMES integrates hierarchical representation with online inference for latent structure learning.
- The model utilizes a nested Chinese Restaurant Process prior and sequential Monte Carlo methods.
- HOLMES achieves compact representations that support one-shot transfer to higher-level categories.
- The model outperforms flat models in context-dependent tasks with nested temporal structures.
Read more
Hierarchical Latent Structure Learning through Online Inference
Summary
The paper presents the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, which addresses the challenge of balancing generalization and discrimination in learning systems. Traditional online latent-cause models assume flat partitions, while hierarchical Bayesian models require offline inference. HOLMES combines a nested Chinese Restaurant Process prior with sequential Monte Carlo inference, enabling trial-by-trial inference over hierarchical latent representations without explicit supervision. The model was evaluated through simulations, demonstrating that it matched the predictive performance of flat models while learning more compact representations that facilitated one-shot transfer to higher-level latent categories. In a context-dependent task with nested temporal structure, HOLMES improved outcome prediction compared to flat models, showcasing its capability to discover hierarchical structures in sequential data.
Methodology
HOLMES is formalized as a Bayesian nonparametric model where observations are assigned to paths through a latent tree. It combines a hierarchical prior over tree structures with sequential Monte Carlo inference, allowing for online updates and dynamic expansion of latent trees based on sequential observations.
Results
In simulations, HOLMES matched the predictive performance of flat models while learning more compact representations. It also improved outcome prediction in context-dependent tasks with nested temporal structures, indicating its effectiveness in capturing latent rule structures across different contexts and timescales.
Implications
The HOLMES model has potential applications in various fields requiring hierarchical structure learning from sequential data, such as cognitive science, artificial intelligence, and decision-making systems. Its ability to perform online inference could enhance adaptive learning systems in dynamic environments.
Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization
Optimization
- Introduction of Q-BioLat framework for protein fitness landscape optimization.
- Utilizes pretrained protein language models to create binary latent representations.
- Models protein fitness as a QUBO problem for efficient combinatorial optimization.
- Demonstrates effective identification of high-fitness protein variants on the ProteinGym benchmark.
Read more
Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization
Summary
This paper introduces Q-BioLat, a novel framework designed for modeling and optimizing protein fitness landscapes in binary latent spaces. The framework begins with protein sequences, utilizing pretrained protein language models to generate continuous embeddings, which are subsequently transformed into compact binary representations. In this binary space, protein fitness is modeled using a quadratic unconstrained binary optimization (QUBO) approach, facilitating efficient combinatorial search through classical optimization techniques such as simulated annealing and genetic algorithms. The authors evaluate Q-BioLat on the ProteinGym benchmark, demonstrating its ability to capture significant structures in protein fitness landscapes and identify high-fitness variants. The results indicate that even with a straightforward binarization method, the framework consistently retrieves sequences that rank highly within the training fitness distribution. The study also reveals that different optimization strategies yield varying performances based on the dimensionality of the latent space, emphasizing the importance of representation design. Moreover, Q-BioLat serves as a bridge between protein representation learning and combinatorial optimization, making it compatible with quantum annealing hardware, thus paving the way for future quantum-assisted protein engineering.
Methodology
The methodology involves encoding protein sequences using pretrained protein language models to obtain continuous embeddings, which are then transformed into binary latent representations. Protein fitness is modeled as a QUBO problem, allowing the application of combinatorial optimization techniques like simulated annealing and genetic algorithms to explore the fitness landscape efficiently.
Results
The evaluation on the ProteinGym benchmark shows that Q-BioLat effectively captures meaningful structures in protein fitness landscapes, consistently identifying high-fitness variants. The framework's optimization strategies exhibit distinct behaviors based on the latent space's dimensionality, with evolutionary search performing better in higher dimensions.
Implications
Q-BioLat's formulation of protein fitness landscapes as QUBO problems opens new avenues for integrating quantum optimization techniques with protein engineering, potentially enhancing the efficiency of protein design and optimization processes.
Taming Epilepsy: Mean Field Control of Whole-Brain Dynamics
Graph Learning
Optimization
Theory
- Introduction of the GK-MFG framework for controlling epileptic seizures.
- Integration of Reservoir Computing with graph-theoretic modeling for effective neural dynamics control.
- Use of graph Laplacian constraints to respect the brain's functional topology.
- Demonstration of robust seizure suppression in high-dimensional networks.
Read more
Taming Epilepsy: Mean Field Control of Whole-Brain Dynamics
Summary
This paper addresses the challenge of controlling high-dimensional neural dynamics during epileptic seizures, which is complicated by the brain's nonlinear characteristics and complex connectivity. The authors propose a novel framework called Graph-Regularized Koopman Mean-Field Game (GK-MFG), which integrates Reservoir Computing (RC) for approximating the Koopman operator with an Alternating Population and Agent Control Network (APAC-Net) to solve distributional control problems. By embedding Electroencephalogram (EEG) dynamics into a linear latent space and applying graph Laplacian constraints derived from the Phase Locking Value (PLV), the GK-MFG framework achieves robust seizure suppression while maintaining the brain's functional topological structure. The study emphasizes the importance of graph-theoretic modeling, reservoir computing, and mean field optimal control in addressing the complexities of brain dynamics and presents a coherent methodology for neuromodulation in epileptic networks.
Methodology
The methodology involves establishing a mean field distribution control objective using APAC-Net, incorporating graph Laplacian regularization to embed the brain's functional topology into the control framework, and employing the RC-Koopman operator to address computational challenges associated with high-frequency nonlinear predictions.
Results
The GK-MFG framework demonstrated unprecedented robustness and accuracy in suppressing seizures within high-dimensional epileptic networks, effectively managing the complex dynamics of neural populations while adhering to the brain's structural constraints.
Implications
The findings suggest potential applications in developing advanced neuromodulation therapies for epilepsy and other neurological disorders, as well as enhancing the understanding of brain dynamics through integrated computational frameworks.
Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification
Theory
Efficient ML
Time Series
- Introduction of the Variational Phasor Circuit (VPC) as a phase-native learning architecture.
- VPC replaces dense weight matrices with trainable phase shifts, enhancing parameter efficiency.
- Demonstrated competitive performance on synthetic BCI benchmarks with fewer parameters than traditional methods.
- VPC serves as a bridge between classical oscillatory signal processing and future quantum systems.
Read more
Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification
Summary
This paper introduces the Variational Phasor Circuit (VPC), a novel deterministic classical learning architecture designed for brain-computer interface (BCI) classification tasks. The VPC operates on the continuous S1 unit circle manifold, utilizing trainable phase shifts and local unitary mixing instead of traditional dense real-valued weight matrices. This phase-native approach allows for effective binary and multi-class classification of spatially distributed signals, particularly in the context of EEG data. The architecture supports compact phase-based decision boundaries and can be extended through stacked VPC compositions with inter-block normalization. The authors demonstrate the efficacy of VPC using synthetic BCI benchmarks, achieving competitive accuracy in mental-state classification tasks while significantly reducing the number of trainable parameters compared to standard Euclidean classifiers. The findings suggest that unit-circle phase interference can serve as a mathematically principled alternative to conventional neural computations, positioning VPC as both a standalone classifier and a potential front-end encoding layer for future hybrid phasor-quantum systems.
Methodology
The VPC is constructed using the PhasorFlow framework, which formalizes computation on the unit circle S1. Data is encoded as unit-magnitude complex states, with trainable parameters represented as phase shifts. The architecture employs local mixing and spectral interference to induce global structure without relying on dense Euclidean matrices. The performance of VPC is evaluated on a synthetic 32-channel BCI benchmark designed for binary and multi-class mental-state decoding.
Results
The VPC achieved competitive accuracy in decoding complex mental states from EEG-like data, outperforming traditional classifiers in terms of parameter efficiency. The results highlight the effectiveness of phase-aligned representations in capturing the oscillatory nature of EEG signals, demonstrating that VPC can serve as a robust alternative to dense neural networks.
Implications
The VPC presents a promising approach for BCI applications, particularly in real-time mental-state classification. Its design also suggests potential for integration with quantum computing frameworks, paving the way for future hybrid systems that leverage both classical and quantum processing capabilities.
A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks
Theory
Optimization
Efficient ML
- Introduction of adaptive wavelet-based activation functions to enhance PINNs.
- Improved training stability and expressive power compared to traditional activation functions.
- Evaluation across multiple classes of PDEs demonstrating superior performance.
- Validation against various models, including baseline PINNs and transformer architectures.
Read more
A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks
Summary
This paper introduces a novel family of adaptive wavelet-based activation functions designed to enhance the performance of Physics-Informed Neural Networks (PINNs). PINNs have shown promise in solving complex scientific and engineering problems by integrating physical laws into neural network training. However, they often encounter failure modes, particularly in handling oscillatory patterns and high-frequency features. The proposed activation functions combine trainable wavelet functions with traditional activation functions like hyperbolic tangent and softplus, improving training stability and expressive power. Five distinct activation functions are developed and evaluated across four classes of partial differential equations (PDEs). The results demonstrate that these adaptive functions significantly outperform traditional activation functions in terms of robustness and accuracy. The effectiveness of the proposed approach is validated through comparisons with baseline PINNs, transformer-based architectures, and other deep learning models, showcasing its generality and applicability in various domains.
Methodology
The study develops five distinct adaptive activation functions that integrate trainable wavelet functions with either trainable or fixed hyperbolic tangent and softplus functions. These functions are systematically evaluated in the context of PINNs applied to four classes of PDEs, with performance metrics compared against traditional activation functions and other neural network architectures.
Results
The proposed wavelet-based activation functions showed significant improvements in robustness and accuracy when applied to PINNs. Comprehensive comparisons indicated that these functions effectively mitigate common failure modes associated with traditional PINN architectures, leading to enhanced performance in solving PDEs.
Implications
The findings suggest that incorporating adaptive wavelet-based activation functions can substantially improve the performance of PINNs in scientific computing and engineering applications. This approach may facilitate more accurate solutions to complex PDEs, potentially impacting fields such as fluid dynamics, medical imaging, and quantum systems.
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
NLP
Large Language Models
Efficient ML
- AFBS-BO automates the hyperparameter tuning process for sparse attention, eliminating the need for manual grid search.
- The method achieves a 3.4x speedup in hyperparameter discovery and requires 8.8x fewer evaluations than traditional methods.
- Configurations discovered by AFBS-BO outperform existing sparse attention baselines while closely matching the quality of dense attention.
- The framework leverages multi-fidelity evaluations to efficiently explore the hyperparameter space.
Read more
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
Summary
This paper addresses the challenges of deploying sparse attention mechanisms in transformer models, which are hindered by the need for manual hyperparameter tuning. The authors introduce AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), an automated framework that optimizes layer- and head-specific hyperparameters for sparse attention without human intervention. AFBS-BO combines Bayesian Optimization for global exploration with binary search for local refinement, utilizing multi-fidelity evaluations to enhance efficiency. The proposed method accelerates hyperparameter discovery by 3.4 times and reduces evaluations by 8.8 times compared to traditional grid search methods. The results demonstrate that AFBS-BO can identify high-sparsity configurations that outperform existing sparse attention techniques while maintaining comparable performance to dense attention. This advancement transforms sparse attention from a manually tuned heuristic into a self-optimizing component, facilitating its integration across various transformer architectures and applications.
Methodology
The AFBS-BO framework employs a three-stage hybrid algorithm: (1) Bayesian Optimization for global exploration of the hyperparameter space using low-fidelity evaluations, (2) Binary Search Refinement for precise tuning using high-fidelity evaluations, and (3) Multi-Input Validation to ensure robustness across diverse inputs.
Results
AFBS-BO demonstrated a hyperparameter discovery time of 3.0 seconds for the 12-layer Llama-2-7B model, significantly faster than the 10.1 seconds required by grid search. It achieved a perplexity of 7.45 at 70.7% sparsity on WikiText-2, outperforming the state-of-the-art H2O method and nearing the theoretical Top-K oracle performance.
Implications
The introduction of AFBS-BO could facilitate the broader adoption of sparse attention mechanisms in production environments, allowing for more efficient and scalable transformer models across various applications in natural language processing and beyond.
Discovering Decoupled Functional Modules in Large Language Models
Large Language Models
Interpretability
- Introduces the function module discovery problem in LLMs, addressing a critical gap in interpretability research.
- Develops the ULCMOD framework with a novel objective function and IterD algorithm for effective module identification.
- Demonstrates superior performance in module discovery compared to existing clustering methods.
- Provides insights into the organization of functions within LLMs, highlighting comprehensiveness and hierarchy.
Read more
Discovering Decoupled Functional Modules in Large Language Models
Summary
This paper addresses the challenge of understanding the internal functional organization of Large Language Models (LLMs) by proposing a novel framework called Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD). The authors formulate the problem of discovering decoupled function modules within LLMs, which has been largely unexplored. The ULCMOD framework disentangles the vast set of neurons in LLMs into distinct modules while simultaneously identifying the topics of input samples associated with these modules. A new objective function is introduced to optimize intra-module activation density and maintain module size balance. The Iterative Decoupling (IterD) algorithm is developed to navigate the complex search space efficiently. Extensive experiments conducted on the Qwen2.5 LLM family demonstrate that the proposed method outperforms existing clustering algorithms in identifying high-quality, disentangled modules that capture meaningful semantic information. Qualitative analyses reveal that the discovered modules exhibit comprehensive functionality, hierarchical organization, and spatial arrangement, contributing significantly to the interpretability of LLMs.
Methodology
The authors propose the ULCMOD framework, which formulates a dual row-column partition problem on the LLM activation matrix. A novel objective function is introduced to optimize the density of activations within modules while balancing their sizes. The IterD algorithm iteratively adjusts neuron and sample partitions to optimize this objective, allowing for effective discovery of decoupled function modules.
Results
The experimental results indicate that the ULCMOD framework successfully identifies high-quality, disentangled modules that outperform existing clustering algorithms. The discovered modules are shown to capture more meaningful semantic information and enhance performance in various downstream tasks. Qualitative analyses confirm the presence of function comprehensiveness, hierarchy, and spatial arrangement within the identified modules.
Implications
The findings of this research provide a significant advancement in the interpretability of LLMs, offering insights into their internal functional organization. This can lead to improved trustworthiness, better diagnosis of model performance, and potential enhancements in model capabilities. The ULCMOD framework can serve as a foundational tool for future research in LLM interpretability.
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
Robotics
Time Series
Theory
- Introduction of HEP statistical methods to UAV fault detection.
- Unified inference framework providing binary detection, false alarm control, and fault characterization.
- Significant performance improvement over traditional methods in detecting blade damage.
- SNPE offers calibrated uncertainty estimates for fault severity.
Read more
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
Summary
This paper presents a novel approach to UAV fault detection by applying statistical methods from high-energy physics (HEP) to detect blade damage in multirotor UAVs. The authors introduce three key techniques: the likelihood ratio test (LRT) for binary detection, the CLs method for controlling false alarm rates, and sequential neural posterior estimation (SNPE) for characterizing fault severity. The system operates on spectral features related to rotor harmonic physics, yielding outputs that include binary detection, controlled false alarm rates, and calibrated posteriors regarding fault severity and motor location. The proposed methods were evaluated on two datasets: UAV-FD, which includes 18 real flights with varying levels of blade damage, and PADRE, a quadrotor platform. The results indicate that the composite LRT achieved an area under the curve (AUC) of 0.862 ± 0.007 on UAV-FD, significantly outperforming traditional methods such as CUSUM and autoencoders. Furthermore, at a 5% false alarm rate, the system detected 93% of significant and 81% of subtle blade damage. The SNPE method provided a comprehensive posterior distribution over fault severity, allowing for a more nuanced understanding of damage compared to existing methods. Overall, this work demonstrates the effectiveness of HEP statistical tools in enhancing UAV fault detection capabilities.
Methodology
The methodology involves applying three statistical techniques from high-energy physics: the likelihood ratio test (LRT) for binary detection, the CLs method for controlling false alarm rates, and sequential neural posterior estimation (SNPE) for estimating fault severity. The system analyzes spectral features from rotor vibrations to differentiate between healthy and faulty conditions.
Results
The composite LRT achieved an AUC of 0.862 ± 0.007 on the UAV-FD dataset and 0.986 on the PADRE dataset, outperforming baseline methods such as CUSUM and autoencoders. The system detected 93% of significant and 81% of subtle blade damage at a 5% false alarm rate, with SNPE providing a posterior distribution that covered 90% of credible intervals with a mean absolute error of 0.012.
Implications
The findings suggest that HEP statistical methods can significantly enhance the reliability and accuracy of UAV fault detection systems, particularly in safety-critical applications. This approach could lead to improved operational safety in autonomous UAVs by providing more detailed diagnostics and reducing false alarms.
Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation
Time Series
- Machine learning is becoming essential for processing seismic and volcanic data, but must adapt to domain shifts.
- Models need to provide uncertainty estimates to support decision-making in monitoring.
- Integrating classical signal processing with ML can enhance performance and reliability.
- Evaluation protocols should reflect the challenges of transferring models across different regions and conditions.
Read more
Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation
Summary
This paper discusses the integration of machine learning (ML) techniques into seismic and volcanic monitoring, emphasizing the need for models that can adapt to changing conditions and provide reliable uncertainty estimates. It highlights the limitations of traditional signal processing methods in the face of evolving data characteristics and the importance of incorporating physical understanding into ML models. The author surveys recent advancements in ML for signal analysis, focusing on the necessity of robust architectures, training strategies, and evaluation protocols that account for domain shifts. The paper concludes by identifying open challenges in developing AI-assisted monitoring systems that are interpretable and maintainable, particularly in the context of nonstationary and noisy wavefields.
Methodology
The paper organizes and surveys various ML approaches for seismic and volcanic signal analysis, discussing the integration of classical signal processing techniques, self-supervised learning, and generative modeling. It emphasizes the importance of uncertainty quantification and the need for evaluation protocols that assess model transferability across different monitoring conditions.
Results
The author identifies key areas where ML can improve seismic and volcanic monitoring, such as event detection, phase picking, and anomaly tracking. The paper highlights the necessity for models that can handle nonstationary data and provides insights into effective training strategies and evaluation methods.
Implications
The findings suggest that a physics-aware approach to ML can significantly enhance the reliability and interpretability of seismic and volcanic monitoring systems, leading to better decision-making in response to geological events. This could improve disaster preparedness and response strategies in affected regions.
Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching
Generative Models
Time Series
Efficient ML
- GMFLOW achieves a 10,000-fold speedup in generating ground-motion time histories compared to traditional methods.
- The framework operates in two stages: low-frequency wavefield generation followed by high-resolution reconstruction.
- GMFLOW supports zero-shot super-resolution, allowing for flexible spatial resolution in predictions.
- The method is validated on simulated earthquake scenarios in the San Francisco Bay Area.
Read more
Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching
Summary
This paper presents Ground-Motion Flow (GMFLOW), a novel framework for generating large-scale, realistic ground-motion time histories for earthquake hazard analysis. Traditional methods for producing these time histories, such as physics-based simulations, are computationally intensive and impractical for engineering applications. GMFLOW leverages a physics-inspired latent operator flow matching approach to generate spatially coherent ground motions conditioned on physical parameters. The framework operates in two stages: first, it generates a low-frequency wavefield capturing the large-scale spatiotemporal structure of earthquake shaking, and then it reconstructs a full-band wavefield from this low-frequency representation. This method achieves a remarkable 10,000-fold speedup over conventional simulation workflows, allowing for rapid generation of ground motions across over 9 million grid points. GMFLOW's design also supports zero-shot super-resolution, enabling predictions at finer spatial resolutions than those used during training. The framework not only enhances the efficiency of seismic hazard assessments but also has the potential to be adapted for modeling other large-scale spatiotemporal phenomena, thereby advancing risk mitigation and infrastructure resilience.
Methodology
GMFLOW employs a two-stage generative modeling approach. The first stage uses an autoencoding operator to map low-frequency wavefields into a latent space, followed by a conditional flow-matching model to generate spatially coherent regional realizations. The second stage utilizes a discretization-agnostic super-resolution neural operator to reconstruct the full-band wavefield from the low-frequency representation.
Results
GMFLOW demonstrated unprecedented computational scalability, generating realistic ground motions across a vast spatial domain in seconds. The framework effectively captured the complex multiscale physics of large-magnitude earthquakes and enabled the generation of large ensembles of scenario-consistent ground motions.
Implications
The advancements presented in GMFLOW can significantly improve the efficiency of seismic hazard assessments, allowing for rapid and uncertainty-aware evaluations of infrastructure resilience. Furthermore, the framework's flexibility and scalability may facilitate its application to other scientific domains involving large-scale spatiotemporal phenomena.
Online Learning and Equilibrium Computation with Ranking Feedback
Theory
Optimization
- Introduces a ranking-based online learning model that does not rely on numeric utility feedback.
- Establishes that sublinear regret is unattainable with instantaneous utility rankings.
- Develops algorithms achieving sublinear regret under specific conditions related to utility variation.
- Demonstrates that the algorithms lead to approximate coarse correlated equilibria in game-theoretic settings.
Read more
Online Learning and Equilibrium Computation with Ranking Feedback
Summary
This paper explores a novel online learning framework where feedback is provided in the form of rankings rather than numeric utilities, addressing scenarios where numeric feedback is unavailable or restricted due to privacy concerns. The authors investigate two types of ranking mechanisms: one based on instantaneous utility and another based on time-average utility. They establish that achieving sublinear regret is impossible under instantaneous utility rankings and remains unattainable under time-average rankings when the ranking model is overly deterministic. To overcome these limitations, the authors propose new algorithms that achieve sublinear regret under the assumption of sublinear total variation in utility sequences. Notably, this assumption can be relaxed in full-information settings. The paper also connects these findings to equilibrium computation in game theory, demonstrating that when all players employ their algorithms in repeated normal-form games, the resulting play approaches a coarse correlated equilibrium. The effectiveness of the proposed algorithms is validated through an application in online large-language-model routing.
Methodology
The authors analyze two ranking feedback mechanisms and derive theoretical results regarding regret minimization. They develop algorithms that utilize these rankings to achieve sublinear regret under certain assumptions and validate their approach through experimental studies.
Results
The paper shows that sublinear regret is impossible with instantaneous utility rankings and under certain conditions with time-average utility rankings. New algorithms are proposed that achieve sublinear regret when the utility sequence has sublinear total variation, and this assumption can be removed in full-information settings. The algorithms also yield approximate coarse correlated equilibria in repeated normal-form games.
Implications
The findings suggest that ranking feedback can be a viable alternative to numeric feedback in online learning scenarios, particularly in human-in-the-loop systems. The results have potential applications in various domains, including recommendation systems and game-theoretic platforms.
AGRI-Fidelity: Evaluating the Reliability of Listenable Explanations for Poultry Disease Detection
Audio & Speech
Interpretability
- AGRI-Fidelity introduces a reliability-centered evaluation paradigm for XAI in bioacoustic disease detection.
- The framework integrates cross-model consensus with fidelity-based causal validation to quantify explanation stability.
- A novel permutation-based null construction is developed to estimate empirical False Discovery Rates, suppressing irrelevant artifacts.
- Extensive experiments show that AGRI-Fidelity consistently differentiates between meaningful signals and domain-irrelevant artifacts.
Read more
AGRI-Fidelity: Evaluating the Reliability of Listenable Explanations for Poultry Disease Detection
Summary
The paper introduces AGRI-Fidelity, a novel evaluation framework designed to assess the reliability of listenable explanations in poultry disease detection systems. Traditional explainable AI (XAI) metrics focus on faithfulness to a single model, neglecting the variability and potential unreliability of explanations across different models. In the context of noisy farm environments, where background artifacts can mislead models, AGRI-Fidelity addresses these shortcomings by combining cross-model consensus with cyclic temporal permutation to create null distributions and compute a False Discovery Rate (FDR). This approach effectively suppresses stationary artifacts while preserving relevant bioacoustic signals. The authors demonstrate that AGRI-Fidelity outperforms standard masking-based metrics in reliably distinguishing genuine acoustic markers from spurious ones, thereby enhancing trust in AI systems used for animal health monitoring.
Methodology
AGRI-Fidelity employs a consensus-based stability score across multiple model families to evaluate explanation consistency. It integrates stability and fidelity into a single reliability score and uses cyclic temporal shifts to create a null distribution for estimating False Discovery Rates, effectively distinguishing between relevant and irrelevant signals in bioacoustic data.
Results
The experiments conducted on real-world and controlled poultry vocalization datasets demonstrate that AGRI-Fidelity achieves reliable differentiation between genuine bioacoustic markers and irrelevant artifacts, outperforming traditional masking-based evaluation metrics. The results indicate stable committee-level behavior across different model initializations.
Implications
The AGRI-Fidelity framework has significant implications for the deployment of AI systems in agricultural settings, particularly in enhancing the interpretability and trustworthiness of models used for animal health monitoring. By providing reliable explanations, it can facilitate expert validation and adoption of AI technologies in livestock management.
From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning
NLP
Large Language Models
- Introduction of AISA-AR-FunctionCall framework for Arabic function calling.
- Significant reduction in parse failures and improvement in function name accuracy.
- Error analysis reveals distinct challenges in serialization stability and reasoning.
- Exploration of reasoning-augmented LoRA variant for enhanced decision-making.
Read more
From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning
Summary
The paper presents AISA-AR-FunctionCall, a framework designed to enhance the reliability of function-calling language models for Arabic, addressing significant structural instability issues. The authors utilize a 270M-parameter FunctionGemma backbone and implement a series of systematic improvements including dataset auditing, schema repair, and tool-aware prompt restructuring. The fine-tuning process leads to a dramatic reduction in parse failures from 87% to below 1%, while also improving function name accuracy by over eight times and enhancing argument alignment across various Arabic dialects and domains. The study highlights the transition from structural collapse to semantic misalignment through error analysis, indicating that serialization stability and decision-level reasoning are distinct challenges. Additionally, a reasoning-augmented LoRA variant is explored, which introduces intermediate reasoning steps before tool invocation. The authors emphasize the importance of a systems-level approach, framing their work within the AISA architecture to ensure reproducibility, auditability, and production readiness for Arabic agentic systems. All datasets and models are publicly available under the AISA framework.
Methodology
The authors employed systematic dataset auditing, schema repair, tool-aware prompt restructuring, and full-parameter supervised fine-tuning on the FunctionGemma backbone to create a reliable function-calling framework for Arabic. They also introduced reasoning supervision in a reason-before-call format to enhance model performance.
Results
The fine-tuning process resulted in a reduction of parse failures from 87% to below 1%, an increase in function name accuracy by more than eightfold, and improved argument alignment across dialects and domains. The error analysis indicated a shift from structural issues to semantic misalignment.
Implications
The findings suggest that the AISA-AR-FunctionCall framework can significantly improve the reliability of Arabic language models in executing structured actions, paving the way for more effective agentic AI systems in Arabic-speaking contexts. The public availability of datasets and models encourages further research and development in this area.
Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems
Generative Models
Theory
Time Series
- Introduces a mathematical framework for balancing short-term accuracy and long-term consistency in neural simulations.
- Proposes the Self-refining Neural Surrogate model (SNS) as a hyperparameter-free solution.
- Demonstrates the effectiveness of SNS in high-fidelity simulations of dynamical systems over long time horizons.
- Addresses the issue of distribution drift in autoregressive models, enhancing their reliability.
Read more
Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems
Summary
This paper addresses the limitations of autoregressive neural surrogate models in simulating dynamical systems, particularly the issue of distribution drift that leads to degraded performance over long time horizons. The authors propose a unifying mathematical framework that explicitly represents the trade-off between short-term accuracy and long-term consistency, which has traditionally been managed through heuristic hyperparameter tuning. They introduce the Self-refining Neural Surrogate model (SNS), a conditional diffusion model designed to dynamically balance the trade-off between conditional approximation error and out-of-distribution (OOD) error. SNS can function independently to refine its outputs or complement existing models to enhance long-term performance. The authors demonstrate the numerical feasibility of SNS through high-fidelity simulations of complex dynamical systems over extended time periods, showcasing its potential for significantly improving simulation quality.
Methodology
The authors develop a multi-noise-level denoising oracle that formalizes the trade-off between conditional approximation error and OOD error. SNS is implemented as a conditional diffusion model trained on a denoising diffusion probabilistic model (DDPM) objective. The model dynamically adjusts the noise level during inference to optimize performance.
Results
The SNS model successfully maintains high fidelity in simulations of complex dynamical systems over arbitrarily long time horizons, outperforming traditional autoregressive models by mitigating the effects of distribution drift and enhancing long-term consistency.
Implications
The proposed SNS model has significant implications for various fields that rely on accurate simulations of dynamical systems, including climate modeling, weather forecasting, and robotics. Its ability to maintain performance over long time periods could lead to more reliable predictions and analyses in these domains.
Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning
Reinforcement Learning
Graph Learning
Optimization
- RL-ASM is the first approach to apply reinforcement learning to approximate subgraph matching.
- The method utilizes a Graph Transformer to fully exploit graph information for improved matching.
- RL-ASM optimizes node pair selection based on long-term rewards rather than greedy heuristics.
- Extensive experiments show RL-ASM outperforms traditional ASM techniques in various scenarios.
Read more
Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning
Summary
This paper addresses the problem of Approximate Subgraph Matching (ASM), which is crucial for various applications in graph analysis but is known to be NP-hard. Traditional methods often rely on heuristic search strategies that fail to fully leverage graph information, resulting in sub-optimal solutions. The authors propose a novel Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that utilizes Graph Transformers to extract comprehensive graph representations and employs reinforcement learning policies for ASM. The RL-ASM model is built on a branch-and-bound algorithm that iteratively selects node pairs from the query and target graphs for potential matches. Unlike heuristic methods, RL-ASM optimizes the selection process through a trained neural network agent that maximizes long-term rewards. The training involves an imitation learning phase guided by supervised signals, followed by fine-tuning using Proximal Policy Optimization (PPO). The experimental results demonstrate that RL-ASM significantly outperforms existing ASM methods in both effectiveness and efficiency across synthetic and real-world datasets.
Methodology
The RL-ASM algorithm employs a Graph Transformer to extract feature representations from the input graphs. It uses a branch-and-bound approach to iteratively select node pairs for matching, enhanced by reinforcement learning techniques. The training process includes an imitation learning phase followed by fine-tuning with Proximal Policy Optimization (PPO) to optimize the agent's policy based on long-term rewards.
Results
The experiments conducted on both synthetic and real-world datasets indicate that RL-ASM achieves superior performance compared to existing ASM methods, demonstrating improvements in both matching effectiveness and computational efficiency.
Implications
The proposed RL-ASM method has significant implications for various fields that rely on graph analysis, including database systems, network science, biochemistry, and privacy protection. Its ability to handle noisy data makes it particularly valuable for real-world applications where exact matching is often infeasible.
Enactor: From Traffic Simulators to Surrogate World Models
Generative Models
Reinforcement Learning
Robotics
- Introduces a transformer-based generative model for traffic simulation.
- Addresses limitations of traditional microsimulators in capturing realistic actor interactions.
- Demonstrates improved performance in generating long-horizon, physically consistent trajectories.
- Requires fewer training samples compared to traditional agent-centric approaches.
Read more
Enactor: From Traffic Simulators to Surrogate World Models
Summary
The paper presents 'Enactor', a novel actor-centric generative model that leverages transformer-based architecture to improve the simulation of traffic dynamics, particularly at intersections. Traditional traffic microsimulators like SUMO often rely on simplistic behavior models that fail to accurately capture complex actor interactions and the dynamics of traffic environments. The authors propose a solution inspired by the World Model paradigm, which allows for the generation of physically grounded trajectories based on learned behaviors from data. The model is tested in a 'simulation-in-the-loop' framework, where initial conditions are generated using SUMO, and the model controls actor dynamics over 40,000 timesteps. Results indicate that Enactor effectively captures complex interactions and generates long-horizon, consistent trajectories while requiring significantly fewer training samples than existing methods. The framework outperforms baseline models in various traffic engineering metrics, demonstrating its potential for realistic traffic simulation and planning.
Methodology
The authors developed an actor-centric generative model using a transformer architecture, which captures actor-actor interactions and the geometric context of traffic intersections. The model was tested in a simulation-in-the-loop setting, where it controlled actor dynamics based on initial conditions generated by SUMO.
Results
The experimental results show that the proposed framework effectively captures complex interactions and generates long-horizon, physically consistent trajectories. It outperformed baseline models in most metrics, including aggregate speed and travel-time metrics, achieving over 10x improvement on KL-Divergence.
Implications
The findings suggest that Enactor can significantly enhance traffic simulation accuracy, providing a more reliable tool for urban planning and traffic management. Its ability to generate realistic trajectories with fewer data requirements could facilitate broader applications in autonomous driving and smart city initiatives.
From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
Multimodal
Robotics
Efficient ML
- Conventional efficiency metrics do not capture the embodied performance of VLA models.
- Embodied efficiency metrics reveal hidden performance differences in learned action policies.
- Common adaptation methods yield limited improvements and may introduce trade-offs.
- Reducing computational costs does not guarantee improved embodied execution efficiency.
Read more
From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
Summary
This paper critiques the current efficiency metrics used in Vision-Language-Action (VLA) models, which typically focus on parameters, FLOPs, or decoding throughput. The authors argue that these metrics do not accurately reflect the real-world performance of robotic platforms. Instead, they propose that embodied efficiency, characterized by system-level behaviors such as task completion time, trajectory smoothness, and energy consumption, should be prioritized. Through controlled studies involving model compression, token sparsification, and action sequence compression, the authors reveal that conventional efficiency improvements can lead to increased execution costs and degraded motion quality. They find that while traditional adaptation methods yield some improvements in specific embodied efficiency metrics, these gains often come with trade-offs in other areas, such as longer completion times. The findings suggest that a shift towards embodied efficiency metrics is necessary for a more comprehensive evaluation of VLA models, enabling better comparisons and practical applications in real-world scenarios.
Methodology
The authors conducted controlled studies across three domains of VLA design: model compression (weight pruning and quantization), token sparsification (visual token pruning), and action sequence compression (reducing temporal redundancy). They evaluated the impact of these techniques using multiple embodied efficiency metrics.
Results
The analysis showed that methods aimed at reducing computational costs often resulted in increased task completion times and decreased motion quality, despite maintaining task success rates. For instance, models with different levels of weight pruning exhibited significant differences in execution efficiency, highlighting the inadequacy of traditional metrics.
Implications
The findings emphasize the need for a paradigm shift in evaluating VLA models, advocating for the incorporation of embodied efficiency metrics to better reflect real-world performance. This could lead to more effective deployment of VLA models in practical applications, particularly in robotics.
FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra
Generative Models
Graph Learning
- FlowMS is the first discrete flow matching framework for de novo molecular generation from mass spectra.
- It achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark.
- FlowMS demonstrates a 9.15% top-1 accuracy, surpassing previous models like DiffMS and MS-BART.
- The framework effectively enforces chemical formula constraints during molecular generation.
Read more
FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra
Summary
This paper presents FlowMS, a novel discrete flow matching framework designed for de novo molecular structure elucidation from mass spectrometry (MS) data. The challenge of reconstructing molecular structures from mass spectra is compounded by the combinatorial complexity of chemical space and the ambiguity of spectral fragmentation patterns. While previous deep learning approaches have made strides in this area, they often face computational limitations. FlowMS addresses these challenges by generating molecular graphs through iterative refinement in probability space, conditioned on spectral embeddings obtained from a pretrained formula transformer encoder. The framework enforces chemical formula constraints during the generation process. Experimental results on the NPLIB1 benchmark demonstrate that FlowMS achieves state-of-the-art performance on five out of six evaluation metrics, including a top-1 accuracy of 9.15%, which represents a 9.7% relative improvement over the previous best method, DiffMS. Additionally, FlowMS shows enhanced structural similarity metrics compared to MS-BART. The visualizations of generated molecules indicate that FlowMS produces structurally plausible candidates that closely resemble ground truth structures, highlighting the potential of discrete flow matching for mass spectrometry-based structure elucidation in fields such as metabolomics and natural product discovery.
Methodology
FlowMS employs a discrete flow matching approach that utilizes linear interpolation noising and continuous-time Markov chain (CTMC) denoising. It integrates a spectrum encoder based on a pretrained formula transformer to condition the generation of molecular graphs on spectral embeddings. The model is trained using an encoder-decoder pretraining strategy on large-scale fingerprint-molecule pairs.
Results
FlowMS achieved a top-1 accuracy of 9.15% and a top-10 MCES of 7.96 on the NPLIB1 benchmark, outperforming DiffMS and MS-BART in multiple metrics. The results indicate significant improvements in both accuracy and structural similarity, validating the effectiveness of the proposed framework.
Implications
The introduction of FlowMS suggests a promising new direction for mass spectrometry-based structure elucidation, potentially enhancing the discovery of new compounds in metabolomics and natural product research. Its ability to generate plausible molecular structures from spectral data could streamline the identification process in various chemical and biological applications.
Procedural Generation of Algorithm Discovery Tasks in Machine Learning
Optimization
Reinforcement Learning
Theory
- DiscoGen generates over 400 million diverse algorithm discovery tasks, addressing limitations of existing benchmarks.
- The framework ensures principled evaluation by separating meta-train and meta-test datasets.
- DiscoBench provides a curated set of tasks for evaluating algorithm discovery agents (ADAs).
- The methodology supports an ADA optimization loop, enhancing the iterative development of algorithms.
Read more
Procedural Generation of Algorithm Discovery Tasks in Machine Learning
Summary
This paper introduces DiscoGen, a procedural generator designed to create a vast array of algorithm discovery tasks in machine learning. The authors argue that existing benchmarks for algorithm discovery agents (ADAs) are limited due to issues such as data contamination, poor evaluation methodologies, and a lack of diverse problems. DiscoGen addresses these limitations by generating over 400 million unique tasks across various machine learning domains, ensuring that the evaluation of ADAs is principled through distinct meta-train and meta-test datasets. The paper also presents DiscoBench, a curated benchmark of tasks for evaluating ADAs. The authors demonstrate DiscoGen's utility through experiments that optimize an ADA for prompt optimization, highlighting its potential to facilitate new research directions in algorithm discovery. The open-source release of DiscoGen is intended to encourage further exploration and development in this area.
Methodology
The authors developed DiscoGen as a procedural generator that creates algorithm discovery tasks by specifying a small number of configuration parameters. This modular approach allows for the generation of tasks across various machine learning subfields, ensuring diversity and complexity. The evaluation framework includes distinct meta-train and meta-test datasets to prevent data contamination and overfitting, while DiscoBench serves as a benchmark for evaluating ADAs.
Results
The experiments conducted using DiscoGen demonstrated its effectiveness in optimizing an ADA for prompt optimization tasks. The results indicated that the procedural generation of tasks could significantly enhance the evaluation and development of algorithm discovery agents, leading to improved performance and novel algorithmic solutions.
Implications
DiscoGen has the potential to revolutionize the field of algorithm discovery by providing a scalable and diverse set of tasks for training and evaluating ADAs. This could lead to breakthroughs in automating machine learning algorithm development, ultimately accelerating research and innovation in the field.
Contextual Preference Distribution Learning
Optimization
- Introduces a method for learning context-dependent preference distributions in decision-making problems.
- Utilizes a sequential learning-and-optimization pipeline to address human preference uncertainty.
- Achieves maximum likelihood estimates with desirable statistical properties.
- Demonstrates substantial reductions in post-decision surprise in a ridesharing context.
Read more
Contextual Preference Distribution Learning
Summary
This paper addresses the challenge of decision-making under uncertainty due to heterogeneous and context-dependent human preferences. The authors propose a sequential learning-and-optimization pipeline that learns preference distributions from observed choices and utilizes these distributions to minimize risk in downstream decision-making problems, particularly in integer linear programming (ILP) contexts. Existing methods often yield point estimates or fail to account for contextual variations, making them inadequate for risk-averse scenarios. The proposed method employs a bounded-variance score function gradient estimator to train a predictive model that maps contextual features to parameterizable preference distributions, resulting in maximum likelihood estimates. The authors demonstrate their approach in a synthetic ridesharing environment, showing significant reductions in average post-decision surprise compared to both risk-neutral approaches and leading risk-averse baselines.
Methodology
The authors develop a sequential learning-and-optimization framework that first predicts human preference distributions based on contextual features. They employ a bounded-variance score function gradient estimator to train a model that generates maximum likelihood estimates of these distributions. This model is then used to create scenarios for unseen contexts during the optimization phase, allowing for risk-averse decision-making.
Results
In experiments conducted within a synthetic ridesharing environment, the proposed approach reduces average post-decision surprise by up to 114 times compared to a risk-neutral approach with perfect predictions, and by up to 25 times compared to existing leading risk-averse baselines.
Implications
The findings suggest that incorporating context-dependent preference distributions can significantly enhance decision-making processes in environments characterized by human choice variability. This methodology could be applied to various domains, including transportation, finance, and any area where human preferences impact outcomes.