AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
Off-Policy Learning with Limited Supply
Reinforcement Learning
Theory
Optimization
- Conventional greedy OPL methods are suboptimal in limited supply scenarios.
- The paper introduces OPLS, which focuses on relative expected rewards for better item allocation.
- Theoretical analysis proves the existence of superior policies in limited supply settings.
- OPLS does not incur additional computational costs compared to existing methods.
Read more
Off-Policy Learning with Limited Supply
Summary
This paper addresses the challenges of off-policy learning (OPL) in contextual bandits when items are subject to limited supply, a common scenario in applications like recommendation systems and online advertising. Traditional OPL methods assume an unconstrained environment where items can be selected infinitely, which can lead to suboptimal performance in real-world situations where items may run out. The authors provide a theoretical analysis demonstrating that conventional greedy approaches can fail to maximize policy performance under these constraints. They introduce a novel method called Off-Policy Learning with Limited Supply (OPLS), which prioritizes items based on their relative expected rewards compared to other users, rather than simply selecting the item with the highest expected reward. This method allows for more efficient allocation of limited resources. Empirical results on both synthetic and real-world datasets indicate that OPLS outperforms existing OPL methods in scenarios with limited supply, highlighting its effectiveness in improving policy performance in constrained environments.
Methodology
The authors analyze the problem of OPL with limited supply theoretically and propose the OPLS method, which selects items based on the relative reward gap. This gap is calculated as a user's expected reward minus the average expected reward across all users, allowing for prioritization of users who can generate higher rewards. The method is validated through empirical experiments on synthetic and real-world datasets.
Results
The empirical results show that OPLS significantly outperforms traditional OPL methods in contextual bandit problems with limited supply, achieving higher policy performance and demonstrating the effectiveness of the proposed approach.
Implications
The findings suggest that OPLS can be effectively applied in various real-world applications, such as e-commerce and coupon allocation, where item availability is constrained. This could lead to improved user satisfaction and better resource management in recommendation systems and online advertising.
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Optimization
Efficient ML
Generative Models
- SOL-ExecBench benchmarks GPU kernels against hardware Speed-of-Light limits rather than software baselines.
- The framework includes 235 optimization problems from diverse AI models, ensuring broad applicability.
- SOL Score quantifies performance improvements relative to analytically derived hardware limits.
- A sandboxed evaluation harness enhances reliability and prevents reward-hacking in kernel optimization.
Read more
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Summary
The paper introduces SOL-ExecBench, a novel benchmarking framework designed to evaluate GPU kernel optimizations against hardware limits rather than traditional software baselines. It comprises 235 CUDA kernel optimization problems derived from 124 production and emerging AI models across various domains, including language, vision, and audio. The benchmark targets NVIDIA Blackwell GPUs and assesses both forward and backward workloads in different precision formats (BF16, FP8, NVFP4). A key innovation is the use of Speed-of-Light (SOL) bounds, which provide a fixed target for optimization by analytically deriving performance limits based on hardware capabilities. The SOL Score quantifies the extent to which a kernel closes the gap between a baseline score and the SOL bounds, promoting a focus on hardware-efficient execution. Additionally, the framework includes a sandboxed evaluation environment to ensure reliable assessments and mitigate reward-hacking strategies. This approach reframes GPU kernel benchmarking, emphasizing the importance of approaching hardware limits in the context of rapidly evolving AI workloads and GPU architectures.
Methodology
The authors developed SOL-ExecBench by extracting computational subgraphs from AI models and curating benchmark problems. They implemented a pipeline called SOLAR to analytically derive Speed-of-Light bounds based on FLOP counts and GPU throughput. The evaluation process includes a sandboxed environment to ensure reproducibility and mitigate optimization manipulation.
Results
The introduction of SOL-ExecBench provides a comprehensive benchmarking framework that effectively measures GPU kernel performance against hardware limits. The SOL Score offers a new metric for evaluating optimization efforts, highlighting the remaining potential for performance improvements relative to hardware capabilities.
Implications
SOL-ExecBench has significant implications for the development of AI systems that require efficient GPU kernel optimization. By focusing on hardware limits, it can guide future hardware designs and improve the performance of emerging AI models, ultimately enhancing computational efficiency in data centers.
Engineering Verifiable Modularity in Transformers via Per-Layer Supervision
Interpretability
- Introduces per-layer supervision to enhance modularity in transformer models.
- Demonstrates that per-layer supervision leads to significantly larger ablation effects.
- Establishes a methodology for transforming interpretability into active control.
- Validates findings through engineered features and causal experiments.
Read more
Engineering Verifiable Modularity in Transformers via Per-Layer Supervision
Summary
This paper addresses the challenge of interpretability in transformer models, which often exhibit a 'Hydra effect' where surgical interventions yield minimal behavioral changes due to redundancy in their architecture. The author proposes a novel approach that combines dual-stream processing, per-layer supervision, and gated attention to expose hidden modularity within transformers. By implementing per-layer supervision, the study demonstrates that models can achieve significantly larger ablation effects (5 to 23 times greater) compared to standard training methods. This enhanced sensitivity allows for greater control over specific behaviors, such as capitalization, by enabling predictable changes in model output when manipulating attention heads. The findings indicate that per-layer supervision not only increases variance in ablation effects but also reveals which predictions depend on specific circuits, thus transforming interpretability from passive observation to active control. The paper validates its approach through engineered features that capture computational dynamics, an architecture that allows for positive control of modularity, and causal experiments that show functional reorganization of tasks across attention heads.
Methodology
The methodology involves three main components: dual-stream processing to separate token and contextual representations, per-layer supervision to provide independent gradient signals at each layer, and gated attention to regularize activation patterns. The study compares two models with identical architecture, one trained with standard objectives and the other with per-layer supervision, to assess the impact on ablation sensitivity and modularity.
Results
Models trained with per-layer supervision exhibited a mean ablation effect of 1.15% with a standard deviation of 6.32%, indicating a wide variance and revealing dependencies of predictions on specific circuits. In contrast, control models showed minimal ablation effects (mean 0.05%, standard deviation 0.63%). The supervised model provided four times greater control leverage on targeted behaviors, demonstrating smooth and predictable changes in output when manipulating attention heads.
Implications
The findings suggest that transformer architectures can be engineered for better interpretability and control, potentially leading to more reliable AI systems in applications requiring transparency and accountability. This approach may also inform future research on model design and training strategies that prioritize modularity and interpretability.
Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN
Generative Models
- Introduces VAE-GAN to enhance parameterization in reservoir data assimilation.
- Addresses limitations of traditional ensemble methods in handling non-Gaussian distributions.
- Demonstrates improved geological plausibility and history matching in reservoir simulations.
- Validates methodology through two distinct case studies with categorical and continuous data.
Read more
Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN
Summary
This paper addresses the limitations of traditional data assimilation methods in petroleum reservoir simulation, particularly the Ensemble Smoother with Multiple Data Assimilation (ESMDA). The authors highlight the challenges posed by finite ensemble sizes and Gaussian assumptions in parameter and data uncertainties, which are often inadequate for non-Gaussian reservoir properties. To overcome these issues, they propose a novel deep learning model, the Variational Autoencoder Generative Adversarial Network (VAE-GAN), which combines the strengths of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). The VAE-GAN is designed to enhance the parameterization of reservoir properties, allowing for better representation of non-Gaussian distributions while maintaining geological plausibility. The methodology was tested on two case studies: one involving categorical data and the other with continuous permeability values. Results indicate that the VAE-GAN model successfully produces high-quality reservoir descriptions akin to GANs while achieving effective history matching of production curves similar to VAEs. This dual capability represents a significant advancement in the field of data assimilation for reservoir simulation.
Methodology
The authors developed a deep learning model that integrates Variational Autoencoders and Generative Adversarial Networks to create a VAE-GAN framework. This model is used to parameterize reservoir properties by mapping non-Gaussian parameters to a Gaussian field and vice versa, facilitating effective data assimilation within the ESMDA framework.
Results
The application of the VAE-GAN model resulted in high-quality reservoir descriptions that maintain geological realism and effective history matching of production curves. The findings from both case studies demonstrated the model's capability to handle both categorical and continuous data effectively.
Implications
The proposed VAE-GAN model has the potential to significantly improve data assimilation processes in petroleum reservoir simulations, leading to more accurate predictions and better management of reservoir resources. This approach could be extended to other fields where non-Gaussian distributions are prevalent.
STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation
Time Series
- STEP framework leverages cross-domain distillation to enhance scientific time-series representation learning.
- Introduces adaptive patching and statistics compensation to handle diverse and heterogeneous scientific signals.
- Demonstrates the transferability of knowledge from foundation models in related domains to improve performance on scientific tasks.
- Achieves strong performance across various scientific time series tasks, indicating its effectiveness as a pretraining paradigm.
Read more
STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation
Summary
The paper introduces STEP, a framework designed for pretraining scientific time-series encoders through cross-domain distillation. Scientific time series data is characterized by its sparsity, heterogeneity, and limited scale, which complicates unified representation learning. The authors explore the transferability of foundation models from related domains (audio, general time series, and neural signals) and demonstrate their complementary strengths for scientific tasks. STEP employs adaptive patching to manage extreme-length sequences and a statistics compensation scheme to address diverse numerical scales. By leveraging cross-domain distillation, STEP integrates knowledge from multiple foundation models, resulting in a unified encoder that learns general-purpose features tailored for scientific signals. The effectiveness of STEP is validated through experiments on seven scientific time series tasks, showcasing its potential to advance representation learning in scientific AI.
Methodology
The authors systematically evaluate foundation models from related time series domains to understand their transferability and complementary strengths. They propose the STEP encoder, which incorporates adaptive patching for sequence length management and a statistics compensation scheme for numerical scale diversity. The framework utilizes cross-domain distillation to integrate knowledge from multiple foundation models, enhancing the learning of general-purpose features for scientific time series.
Results
Experiments conducted on seven scientific time series tasks demonstrate that STEP establishes an effective pretraining paradigm, achieving significant improvements in performance compared to existing methods. The integration of knowledge from diverse domains allows STEP to effectively handle the challenges posed by scientific time series data.
Implications
The findings suggest that STEP can facilitate advancements in scientific AI by providing robust representation learning for time series data across various scientific domains. This could lead to improved modeling, understanding, and prediction of complex scientific phenomena.
NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference
Large Language Models
Theory
Efficient ML
- NANOZK provides a cryptographic verification mechanism for LLM inference, addressing trust issues in LLM-as-a-service.
- The layerwise proof framework allows for independent layer proofs, significantly reducing computational overhead.
- Lookup table approximations for non-arithmetic operations ensure zero accuracy loss during verification.
- Fisher information is used to prioritize layer verification, enhancing efficiency in resource-constrained scenarios.
Read more
NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference
Summary
The paper introduces NANOZK, a novel zero-knowledge proof system designed to ensure verifiable inference for large language models (LLMs). As users increasingly rely on proprietary LLM APIs, the lack of cryptographic assurance raises concerns about the authenticity of the models being used. NANOZK addresses this issue by allowing users to cryptographically verify that the outputs they receive correspond to the computations of a specific model. The authors leverage the natural layerwise structure of transformer models, enabling independent proofs for each layer, which significantly reduces the computational burden compared to traditional monolithic approaches. This layerwise decomposition not only allows for constant-size proofs regardless of model width but also facilitates parallel proving, resulting in substantial speed improvements. The paper also presents lookup table approximations for non-arithmetic operations, ensuring no degradation in model accuracy. Additionally, a Fisher information-guided verification method is introduced to prioritize layer verification based on importance, optimizing resource usage. The results demonstrate that NANOZK can generate proofs for GPT-2 scale transformers in just 43 seconds with a proof size of 6.9KB and a verification time of 23ms, achieving a 52Γ speedup over existing systems while maintaining high accuracy.
Methodology
The authors propose a layerwise decomposition of transformer inference, allowing each layer to generate independent, constant-size zero-knowledge proofs. They develop lookup table approximations for non-arithmetic operations and introduce a Fisher information-guided approach for selective layer verification. This modular framework supports parallel proving and efficient resource usage.
Results
NANOZK successfully generates proofs for GPT-2 scale transformers in 43 seconds, with a proof size of 6.9KB and a verification time of 23ms. The system achieves a 52Γ speedup over the EZKL toolkit, particularly excelling with larger models where memory constraints are an issue. The lookup approximations maintain model perplexity without any measurable accuracy loss.
Implications
The introduction of NANOZK has significant implications for industries relying on LLMs, such as healthcare and legal sectors, where trust and verification of model outputs are critical. It enables users to ensure they receive the promised model capabilities, thereby enhancing the reliability of AI-driven decision-making processes.
Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification
Graph Learning
Theory
- Spectral GNNs do not utilize true Fourier bases for graph signals.
- Polynomial approximations in Spectral GNNs are theoretically flawed.
- The performance of GCNs is attributed to message-passing dynamics rather than spectral filtering.
- Empirical success of models like MagNet and HoloNet is linked to implementation issues, not spectral properties.
Read more
Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification
Summary
This position paper critically examines the theoretical foundations of Spectral Graph Neural Networks (Spectral GNNs) in the context of node classification. The authors argue that these networks, which are often praised for their frequency-domain filtering capabilities, are fundamentally flawed. They identify two main theoretical issues: first, the commonly used 'graph Fourier bases' do not function as true Fourier bases for graph signals; second, the polynomial approximation methods employed in Spectral GNNs are not justified, as they can exactly interpolate spectral responses using (nβ1)-degree polynomials. The paper challenges the prevailing notion that the effectiveness of Graph Convolutional Networks (GCNs) is due to spectral low-pass filtering, demonstrating instead that their performance is primarily a result of message-passing dynamics. The authors analyze two specific directed spectral models, MagNet and HoloNet, revealing that their success is not due to spectral mechanisms but rather to implementation issues that align them more closely with simpler Message Passing Neural Networks (MPNNs). Overall, the paper concludes that Spectral GNNs do not effectively capture graph spectra or enhance performance in node classification, and their competitive results can be better explained through their equivalence to MPNNs.
Methodology
The authors employ theoretical analysis to critique the foundations of Spectral GNNs, demonstrating the inadequacies of graph Fourier bases and polynomial approximations. They also analyze the performance of specific spectral models through a comparative lens with MPNNs.
Results
The paper reveals that Spectral GNNs fail to capture the graph spectrum meaningfully and do not provide reliable performance improvements for node classification. The analysis shows that the effectiveness of certain directed spectral models is due to their implementation rather than their theoretical design.
Implications
This work suggests a need for reevaluation of the theoretical underpinnings of Spectral GNNs and encourages researchers to focus on the message-passing dynamics that may offer more reliable performance in graph-based tasks.
Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- IBD formalizes the Causal Sphere of Influence, distinguishing between causally relevant and confounded dimensions.
- The method requires no learned models and can be applied as a preprocessing step to any RL algorithm.
- Empirical results show that IBD outperforms traditional observational feature selection, especially in high distractor scenarios.
- IBD provides a diagnostic framework to decompose environmental difficulties into representational confusion and exploration challenges.
Read more
Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning
Summary
This paper addresses the challenge of selecting relevant state dimensions in reinforcement learning (RL) when confounded by distractors. The authors formalize this problem as discovering the agent's Causal Sphere of Influence (SoI) and introduce a novel approach called Interventional Boundary Discovery (IBD). IBD utilizes Pearl's do-operator to apply interventions through the agent's actions, allowing for a clear distinction between causally influenced dimensions and confounded distractors. The method employs two-sample testing to generate a binary mask indicating which observation dimensions are causally relevant. IBD is model-free, lightweight, and can be integrated with any downstream RL algorithm. The empirical evaluation across 12 continuous control tasks reveals that traditional observational feature selection methods often fail, particularly when distractors outnumber relevant features. In contrast, IBD consistently tracks oracle performance, demonstrating its effectiveness in identifying causal dimensions. The framework also provides diagnostic insights into the environment's complexity, helping practitioners identify whether issues arise from representational confusion or exploration bottlenecks.
Methodology
Interventional Boundary Discovery (IBD) applies the do-operator to the agent's actions, performing two-sample testing with multiple-testing correction to produce a binary mask over observation dimensions. This mask indicates which dimensions are causally influenced by the agent's actions, allowing for effective feature selection without the need for learned models.
Results
The evaluation across 12 continuous control settings shows that IBD closely tracks oracle performance across varying levels of distractors, while traditional full-state RL methods degrade significantly when distractors outnumber relevant features by a ratio of approximately 3:1. IBD also successfully identifies partially controllable dimensions down to about 5% causal variance.
Implications
The findings suggest that IBD can significantly enhance the performance of RL agents in complex environments by improving feature selection and understanding the causal structure of the state space. This has potential applications in robotics and other domains where agents must learn from high-dimensional sensory inputs with irrelevant distractions.
FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra
Generative Models
Graph Learning
- FlowMS is the first discrete flow matching framework for de novo molecular generation from mass spectra.
- It achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark.
- FlowMS demonstrates a 9.15% top-1 accuracy, surpassing previous methods.
- The framework effectively enforces chemical formula constraints during molecular generation.
Read more
FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra
Summary
The paper presents FlowMS, a novel discrete flow matching framework designed for de novo molecular structure elucidation from mass spectrometry (MS) data. Traditional methods for interpreting mass spectra face challenges due to the combinatorial complexity of chemical structures and the ambiguity of fragmentation patterns. While recent deep learning approaches have made strides in this area, they often require extensive computational resources. FlowMS leverages discrete flow matching, a technique that has shown promise in graph generation, to generate molecular graphs conditioned on spectral embeddings. The framework incorporates a pretrained formula transformer encoder to enforce chemical formula constraints during the generation process. The authors demonstrate that FlowMS achieves state-of-the-art performance on the NPLIB1 benchmark, outperforming existing methods such as DiffMS and MS-BART in multiple evaluation metrics. Visualizations of the generated molecules indicate that FlowMS produces structurally plausible candidates that closely resemble known structures, highlighting its potential for applications in metabolomics and natural product discovery.
Methodology
FlowMS employs a discrete flow matching approach that utilizes iterative refinement in probability space to generate molecular graphs. It conditions on spectral embeddings produced by a pretrained formula transformer encoder and enforces chemical formula constraints throughout the generation process. The model uses linear interpolation noising and continuous-time Markov chain denoising techniques to enhance the generation quality.
Results
FlowMS achieved a top-1 accuracy of 9.15%, representing a 9.7% relative improvement over the previous best method, DiffMS. It also recorded a top-10 MCES of 7.96, outperforming MS-BART by 4.2%. The framework's performance on the NPLIB1 benchmark indicates significant advancements in structure elucidation from mass spectra.
Implications
The introduction of FlowMS establishes discrete flow matching as a viable and effective method for mass spectrometry-based structure elucidation, which could significantly enhance molecular discovery in fields such as metabolomics and natural product research.
Transformers Learn Robust In-Context Regression under Distributional Uncertainty
Theory
- Transformers can perform in-context learning for linear regression under distributional uncertainties.
- The study systematically evaluates the performance of Transformers against classical regression methods under various distributional shifts.
- Transformers demonstrate robustness and adaptability beyond traditional estimators in non-Gaussian and non-i.i.d. settings.
- The research identifies specific regimes where Transformers outperform classical methods, including sample size and feature covariance.
Read more
Transformers Learn Robust In-Context Regression under Distributional Uncertainty
Summary
This paper investigates the ability of Transformers to perform in-context learning (ICL) for linear regression tasks under realistic distributional uncertainties, which deviate from the standard i.i.d. Gaussian assumptions typically used in prior studies. The authors highlight that real-world data often presents challenges such as non-Gaussian noise, heavy-tailed distributions, and dependencies among input features. They conduct a systematic empirical analysis comparing Transformer performance against classical regression methods like Ordinary Least Squares (OLS) and Ridge regression across various distributional shifts. The findings reveal that Transformers can effectively adapt to these shifts, often matching or outperforming classical estimators, thus demonstrating their robustness in ICL scenarios. The study isolates the effects of different distributional shifts, providing insights into the conditions under which Transformers exhibit superior performance compared to traditional methods.
Methodology
The authors conducted a comprehensive empirical investigation by varying the distributions of input features, regression coefficients, and noise away from standard i.i.d. Gaussian settings. They compared the performance of Transformers against classical regression methods, including OLS and Ridge regression, using matched priors to evaluate both in-distribution performance and generalization under distributional shifts.
Results
The results indicate that Transformers consistently match or outperform classical regression methods across various distributional shifts, demonstrating their robustness in in-context learning scenarios. The experiments reveal that Transformers can adapt to non-Gaussian and non-i.i.d. distributions, highlighting their potential advantages over traditional estimators in real-world applications.
Implications
The findings suggest that Transformers could be effectively utilized in practical applications involving noisy and complex data distributions, such as finance, healthcare, and environmental modeling. This research opens avenues for further exploration of Transformer capabilities in diverse real-world scenarios, potentially leading to improved predictive modeling techniques.
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
NLP
Large Language Models
Reinforcement Learning
- VC-Soup addresses the limitations of existing multi-value alignment methods for LLMs.
- The framework introduces a value consistency metric to filter low-consistency data.
- It trains value-consistent policy models that enhance linear mode connectivity.
- The approach combines policies and applies Pareto filtering for balanced performance.
Read more
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
Summary
The paper addresses the challenge of aligning large language models (LLMs) with multiple, potentially conflicting human values, which is crucial for trustworthy AI applications. Existing methods for multi-value alignment, such as reward reweighting and model merging, face significant limitations, including the high cost of training separate models for each value combination and performance degradation due to value conflicts. To overcome these issues, the authors propose VC-Soup, a novel framework that emphasizes value consistency in data. They introduce a value consistency metric based on cosine similarity to filter out low-consistency preference pairs from value datasets. This filtered data is then used to train value-consistent policy models that maintain linear mode connectivity. The final step involves linearly combining these policies and applying Pareto filtering to achieve balanced multi-value performance. The authors demonstrate through extensive experiments and theoretical analysis that VC-Soup effectively mitigates conflicts and outperforms existing multi-value alignment methods, providing a more efficient and coherent approach to aligning LLMs with diverse human values.
Methodology
The methodology involves designing a value consistency metric based on cosine similarity to assess the coherence of preference pairs across different values. Low-consistency pairs are filtered out, and the remaining data is used to train smooth, value-consistent policy models. These models are then combined linearly, and Pareto filtering is applied to balance performance across multiple values.
Results
The results indicate that VC-Soup significantly reduces conflicts in multi-value alignment and consistently outperforms existing methods. The framework demonstrates improved alignment performance across diverse human values, showcasing its effectiveness in real-world applications.
Implications
The findings suggest that VC-Soup can enhance the reliability and safety of LLMs in various applications, including content generation and decision-making, by ensuring better alignment with human values. This could lead to more trustworthy AI systems that cater to a broader range of user preferences.
Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning
Federated Learning
Graph Learning
Theory
- Introduction of an active auditing framework for DFL to counter adaptive backdoor attacks.
- Development of three novel auditing metrics to expose hidden backdoors in local models.
- Implementation of a topology-aware defense placement strategy to enhance resilience.
- Theoretical analysis of convergence rates under co-evolving attack and defense dynamics.
Read more
Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning
Summary
This paper addresses the vulnerabilities of Decentralized Federated Learning (DFL) to adaptive backdoor attacks that can bypass traditional passive defenses. The authors propose a novel active auditing framework that shifts the defensive paradigm from passive aggregation to proactive intervention. They establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. The framework includes three new auditing metrics: stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis, which utilize private probes to stress-test local models and expose latent backdoors. Additionally, a topology-aware defense placement strategy is introduced to enhance global aggregation resilience. The paper provides theoretical insights into the system's convergence under co-evolving attack and defense dynamics. Empirical evaluations demonstrate that the proposed active framework effectively mitigates stealthy, adaptive backdoors while maintaining the utility of primary tasks, outperforming existing state-of-the-art defenses.
Methodology
The authors establish a dynamical model to analyze the diffusion of adversarial updates in DFL networks. They introduce three new metrics for auditing local models and implement a Multi-Armed Bandit framework for dynamic evaluation of neighbor reliability. A topology-aware defense allocation mechanism is designed to optimize the placement of auditing nodes based on the underlying graph structure.
Results
The proposed active auditing framework significantly reduces the success rate of adaptive backdoor attacks while preserving the performance of primary tasks. The theoretical diffusion bounds correlate with the attack success rate, indicating the effectiveness of the auditing metrics and defense strategies.
Implications
The findings suggest that active auditing can enhance the security of decentralized federated learning systems, making them more resilient to sophisticated attacks. This approach could be applied to various domains where federated learning is utilized, such as healthcare, finance, and IoT, to ensure model integrity and reliability.
Are complicated loss functions necessary for teaching LLMs to reason?
NLP
Large Language Models
Reinforcement Learning
- Negative feedback is essential for effective learning in LLMs.
- PPO-style constraints are not necessary for improving mathematical reasoning.
- RGRA, a simplified variant of GRPO, shows superior performance on reasoning tasks.
- Simpler reinforcement learning methods can enhance reasoning in LLMs.
Read more
Are complicated loss functions necessary for teaching LLMs to reason?
Summary
This paper investigates the necessity of complex loss functions in training large language models (LLMs) for reasoning tasks. The authors analyze Group Relative Policy Optimization (GRPO), a method that combines group-relative advantage estimation, PPO-style clipping, and KL regularization. They identify two critical insights: (1) negative feedback is crucial for effective learning, as training only on positive actions limits model performance; and (2) PPO-style constraints are not essential for enhancing mathematical reasoning. Based on these findings, the authors propose a simplified method called REINFORCE with Group Relative Advantage (RGRA), which retains the group-relative advantage estimation while omitting the more complex PPO-style components. Experiments demonstrate that RGRA can outperform GRPO on standard mathematical benchmarks, suggesting that simpler reinforcement learning approaches can effectively improve reasoning capabilities in LLMs, offering a more transparent and efficient alternative to GRPO.
Methodology
The authors conducted a systematic analysis of GRPO by isolating and removing components of its loss function to determine their necessity for effective learning. They proposed RGRA, which simplifies the GRPO approach by retaining group-relative advantage estimation while eliminating PPO-style clipping and policy ratio terms. The performance of RGRA was evaluated against standard mathematical benchmarks.
Results
The experiments indicated that RGRA achieved stronger performance than GRPO on mathematical reasoning tasks, demonstrating that simpler reinforcement learning methods can be effective in enhancing reasoning capabilities in LLMs.
Implications
The findings suggest that LLMs can be trained more efficiently with simpler loss functions, which may lead to more interpretable and robust models. This could influence future research and applications in LLM training methodologies, particularly in reasoning-focused tasks.
BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection
Time Series
Reinforcement Learning
Optimization
- Introduces a reconstruction-driven framework for generating hard negatives in TSAD.
- Utilizes reinforcement learning to adaptively optimize the negative sample generation process.
- Improves anomaly representation learning by focusing on boundary-aware negative samples.
- Achieves competitive detection performance on benchmark datasets.
Read more
BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection
Summary
The paper presents BoundAD, a novel framework for time series anomaly detection (TSAD) that focuses on improving the quality of negative sample generation through a boundary-aware approach. Traditional contrastive learning methods in TSAD often rely on random perturbations or pseudo-anomaly injections, which can fail to maintain temporal semantic consistency and provide effective decision-boundary supervision. BoundAD addresses these limitations by employing a reconstruction-driven strategy that generates hard negatives directly from normal samples. The framework utilizes a reconstruction network to capture normal temporal patterns and incorporates reinforcement learning to adaptively adjust the optimization process based on the reconstruction state. This allows for the generation of boundary-shifted samples that are closer to the normal data manifold, enhancing the contrastive representation learning process. The experimental results demonstrate that BoundAD significantly improves anomaly representation learning and achieves competitive performance on benchmark datasets, indicating its effectiveness in TSAD without the need for predefined anomaly patterns.
Methodology
The BoundAD framework employs a reconstruction network to learn normal temporal patterns and utilizes reinforcement learning to dynamically adjust the generation of negative samples. This approach focuses on creating hard negatives that are located near the boundary of normality, thereby improving the contrastive learning process without requiring explicit anomaly injections.
Results
The proposed method demonstrated significant improvements in anomaly representation learning and achieved competitive performance on various benchmark datasets, outperforming traditional methods that rely on random perturbations or fixed anomaly injections.
Implications
BoundAD has potential applications in various fields requiring time series anomaly detection, such as industrial monitoring, healthcare, and cybersecurity. Its ability to generate meaningful negative samples can lead to more robust anomaly detection systems, reducing false positive rates and improving overall detection accuracy.
Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning
Reinforcement Learning
Graph Learning
Optimization
- RL-ASM is the first approach to apply reinforcement learning for high-quality approximate subgraph matching.
- The method utilizes a Graph Transformer to fully exploit graph information for improved matching.
- RL-ASM optimizes node pair selection based on long-term rewards rather than greedy heuristics.
- Extensive experiments show RL-ASM's superior performance compared to traditional ASM algorithms.
Read more
Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning
Summary
This paper addresses the problem of Approximate Subgraph Matching (ASM), which is crucial for various applications in graph analysis but is known to be NP-hard. Traditional methods often rely on heuristic search strategies that do not fully leverage graph information, resulting in sub-optimal solutions. The authors propose a novel Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that utilizes Graph Transformers to extract comprehensive graph representations and employs reinforcement learning policies to enhance matching accuracy. The RL-ASM algorithm is built on a branch-and-bound framework, selecting node pairs from the query and target graphs iteratively. The model is trained using supervised signals in an imitation learning phase, followed by fine-tuning with Proximal Policy Optimization (PPO) to maximize long-term rewards. Experimental results on synthetic and real-world datasets demonstrate that RL-ASM significantly outperforms existing ASM methods in both effectiveness and efficiency, marking a substantial advancement in the field.
Methodology
The RL-ASM algorithm employs a Graph Transformer for feature extraction from graphs and uses reinforcement learning to optimize the selection of node pairs during the matching process. The training involves an imitation learning phase followed by fine-tuning using Proximal Policy Optimization (PPO) to maximize cumulative rewards.
Results
The RL-ASM algorithm outperformed existing ASM methods in terms of both effectiveness and efficiency across various synthetic and real-world datasets, demonstrating its capability to achieve solutions closer to optimal matches.
Implications
The findings suggest that RL-ASM can be applied in diverse fields such as database systems, network science, biochemistry, and privacy protection, where approximate subgraph matching is essential, especially in the presence of noisy data.
Towards Differentiating Between Failures and Domain Shifts in Industrial Data Streams
Time Series
Interpretability
- The proposed method effectively distinguishes between failures and healthy domain shifts in industrial data streams.
- Integration of a modified Page-Hinkley changepoint detector enhances the identification of changes in data distribution.
- Supervised domain-adaptation algorithms facilitate fast online anomaly detection.
- Explainable AI components support human operators in decision-making processes.
Read more
Towards Differentiating Between Failures and Domain Shifts in Industrial Data Streams
Summary
This paper addresses the critical challenge of distinguishing between failures and domain shifts in industrial data streams, particularly in the context of anomaly detection in manufacturing processes. The authors highlight that traditional anomaly and failure detection methods often misinterpret normal changes in data distributionβsuch as those arising from production line adjustmentsβas failures, leading to unnecessary alerts and maintenance costs. To tackle this issue, the authors propose a novel method that integrates a modified Page-Hinkley changepoint detector to identify both domain shifts and failures, alongside supervised domain-adaptation algorithms for rapid online anomaly detection. This approach is augmented with an explainable AI (XAI) component that aids human operators in differentiating between genuine failures and healthy domain shifts. The effectiveness of the proposed method is demonstrated through experiments conducted on a data stream from a steel factory, showcasing its ability to reduce false positives and improve operational robustness.
Methodology
The methodology involves a combination of a modified Page-Hinkley changepoint detector for identifying domain shifts and potential failures, along with supervised domain-adaptation algorithms for online anomaly detection. An explainable AI component is incorporated to assist human operators in interpreting the results and making informed decisions.
Results
The experiments conducted on the steel factory data stream showed that the proposed method significantly reduces false positive alarms associated with misidentified failures, thereby enhancing the reliability of anomaly detection systems in industrial settings.
Implications
The findings suggest that the proposed method can lead to more efficient monitoring of industrial processes, reducing unnecessary maintenance costs and improving overall operational efficiency. It also highlights the importance of integrating explainable AI in industrial applications to support human decision-making.
MST-Direct: Matching via Sinkhorn Transport for Multivariate Geostatistical Simulation with Complex Non-Linear Dependencies
Optimization
Theory
Generative Models
- MST-Direct preserves complex non-linear dependencies in multivariate geostatistical simulations.
- The method utilizes Optimal Transport theory and the Sinkhorn algorithm for distribution matching.
- It processes all variables simultaneously, enhancing computational efficiency.
- Experimental validation shows 100% shape preservation across diverse relationship types.
Read more
MST-Direct: Matching via Sinkhorn Transport for Multivariate Geostatistical Simulation with Complex Non-Linear Dependencies
Summary
This paper introduces MST-Direct, a novel algorithm designed for multivariate geostatistical simulation that effectively captures complex non-linear dependencies among geological variables. Traditional methods like Gaussian Copula and LU Decomposition often fail to preserve intricate joint distribution shapes due to their reliance on linear correlation structures. MST-Direct leverages Optimal Transport theory, specifically the Sinkhorn algorithm, to directly match multivariate distributions while maintaining spatial correlation. The method processes all variables as a single multidimensional vector and employs relational matching with k-nearest neighbor adjacency to ensure spatial coherence. The authors validate MST-Direct through extensive experiments, comparing it against traditional methods on synthetic data exhibiting five types of complex bivariate relationships. The results demonstrate that MST-Direct achieves perfect shape preservation across all tested relationship types, while also maintaining competitive variogram reproduction, marking a significant advancement in accurately modeling non-linear geological dependencies.
Methodology
The methodology involves formulating the problem using Optimal Transport to find a coupling between source and target distributions that minimizes transport costs. The Sinkhorn algorithm is employed to solve the entropy-regularized optimal transport problem, allowing for efficient computation. The approach maintains spatial coherence through k-nearest neighbor adjacency and processes all variables simultaneously as a multidimensional vector.
Results
The experiments reveal that MST-Direct achieves perfect histogram similarity (100%) across all tested complex bivariate relationships, including step functions, Gaussian mixtures, sinusoidal patterns, random branching, and heteroscedastic relationships. Additionally, it maintains competitive variogram reproduction compared to traditional methods.
Implications
MST-Direct has significant implications for fields requiring accurate modeling of geological phenomena, such as resource management, environmental studies, and risk assessment in geostatistical applications. Its ability to handle complex non-linear dependencies opens new avenues for more realistic simulations in various scientific and engineering domains.
Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus
Federated Learning
Multimodal
- Introduction of CoMFed, a novel framework for multi-modal federated learning.
- Utilization of learnable projection matrices for generating compressed latent representations.
- Implementation of a robust alignment regularizer based on geometric-median consensus.
- Demonstration of competitive performance on real-world multi-modal datasets.
Read more
Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus
Summary
This paper addresses the challenges of applying Federated Learning (FL) in multi-modal settings, where clients possess heterogeneous modalities and model architectures. The proposed framework, CoMFed, utilizes learnable projection matrices to create compressed latent representations, allowing for effective collaboration among clients without the need for shared data or architectures. A latent-space regularizer is introduced to align these representations across clients, enhancing cross-modal consistency and robustness against outliers. The authors demonstrate that CoMFed achieves competitive accuracy in human activity recognition tasks while minimizing communication overhead, thus addressing the critical issues of communication efficiency and robustness in multi-modal FL scenarios.
Methodology
The CoMFed framework employs learnable projection matrices to map client-specific features into a shared latent space, facilitating the exchange of semantically comparable information. A robust latent-space consensus mechanism is introduced, leveraging geometric-median regularization to enhance resilience against outliers and Byzantine clients. The framework operates in a single-stage training process, avoiding the need for public datasets.
Results
Experimental evaluations on human activity recognition benchmarks indicate that CoMFed achieves competitive accuracy compared to existing methods while maintaining minimal communication overhead. The results highlight the framework's effectiveness in enabling collaboration among heterogeneous clients in multi-modal settings.
Implications
The proposed framework has significant implications for real-world applications where data privacy is paramount, such as healthcare and smart environments. By enabling effective collaboration among diverse devices without sharing raw data, CoMFed could enhance the performance of machine learning models in various multi-modal contexts.
Foundations of SchrΓΆdinger Bridges for Generative Modeling
Generative Models
Theory
Optimization
- SchrΓΆdinger bridges unify various generative modeling frameworks, including diffusion models and flow matching.
- The paper develops a mathematical foundation for the SchrΓΆdinger bridge problem, linking it to optimal transport and stochastic control.
- A comprehensive toolkit for constructing SchrΓΆdinger bridges is introduced, facilitating task-specific computational methods.
- The framework is applicable to diverse problems, including data translation and single-cell state dynamics modeling.
Read more
Foundations of SchrΓΆdinger Bridges for Generative Modeling
Summary
This paper presents a comprehensive exploration of SchrΓΆdinger bridges as a foundational framework for generative modeling. It emphasizes the transformation of simple prior distributions into complex target distributions through stochastic paths in probability space. The author develops the mathematical underpinnings of the SchrΓΆdinger bridge problem, integrating concepts from optimal transport, stochastic control, and path-space optimization. The guide details both static and dynamic formulations of the SchrΓΆdinger bridge problem, illustrating their connections to modern generative modeling techniques such as diffusion models and score-based methods. A toolkit for constructing SchrΓΆdinger bridges from first principles is introduced, leading to the development of generalized and task-specific computational methods. The paper also discusses various applications of generative modeling using SchrΓΆdinger bridges, including data translation, modeling single-cell state dynamics, and sampling Boltzmann distributions, thereby providing a unified theoretical perspective that enhances the understanding and application of generative models.
Methodology
The methodology involves a detailed mathematical analysis of the SchrΓΆdinger bridge problem, including static and dynamic formulations. The author employs concepts from optimal transport, stochastic control, and path-space optimization to derive a toolkit for constructing SchrΓΆdinger bridges. The paper also explores various algorithms, including Sinkhorn's algorithm and stochastic optimal control techniques, to facilitate the implementation of these concepts in generative modeling.
Results
The main results include the establishment of a theoretical framework that connects SchrΓΆdinger bridges with modern generative modeling techniques. The author demonstrates how to construct SchrΓΆdinger bridges from first principles and outlines their applications in various generative tasks, showcasing the flexibility and generalizability of the framework.
Implications
The implications of this work are significant for the field of generative modeling, as it provides a unified theoretical foundation that can simplify the understanding and application of various generative techniques. This framework can lead to improved algorithms for generative tasks, enhancing performance in areas such as data translation and biological modeling.
Frayed RoPE and Long Inputs: A Geometric Perspective
NLP
Large Language Models
Theory
- RoPE causes performance issues for long inputs due to the dispersion of key/query clusters.
- Attention mechanisms create sink tokens that help prevent over-mixing of information.
- RoPE-ID modifies RoPE by applying high-frequency rotation to a fraction of channels, improving generalization.
- The proposed method shows strong performance on long-context tasks compared to previous tuning-free approaches.
Read more
Frayed RoPE and Long Inputs: A Geometric Perspective
Summary
This paper addresses the limitations of Rotary Positional Embedding (RoPE) in transformer models, particularly when handling long input sequences that exceed the training context length. The authors provide a unified geometric perspective on the behavior of attention mechanisms with RoPE, revealing that long inputs disrupt the clustering of key and query latent point clouds, leading to performance degradation. They introduce RoPE-ID (In Distribution), a modification that applies high-frequency rotation to a subset of channels, allowing attention layers to generalize better to longer inputs. The effectiveness of RoPE-ID is validated through experiments on the LongBench and RULER benchmarks using 1B and 3B parameter Transformers, demonstrating significant improvements in context length generalization without the need for extensive tuning.
Methodology
The authors conducted both empirical and theoretical analyses to explore the geometric properties of attention mechanisms with RoPE. They identified the clustering behavior of key and query points and proposed RoPE-ID as a modification to enhance performance on long inputs. Experiments were performed using large-scale Transformers on established benchmarks to validate the effectiveness of their approach.
Results
The introduction of RoPE-ID led to improved context length generalization in transformer models, as evidenced by strong performance on the LongBench and RULER benchmarks. The results indicate that RoPE-ID effectively maintains the necessary separation of key/query clusters, allowing for better handling of longer input sequences.
Implications
The findings suggest that RoPE-ID can be a valuable tool for enhancing the performance of large language models on long-context tasks, potentially leading to more robust applications in natural language processing and other fields that require handling of extended sequences.
SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training
Reinforcement Learning
Large Language Models
Robotics
- SLEA-RL retrieves experiences at each decision step, enhancing adaptability in multi-turn tasks.
- The framework includes a self-evolving experience library that maintains quality through score-based admission and rate-limited extraction.
- Empirical validation shows SLEA-RL outperforms standard RL and experience-augmented methods on multiple benchmarks.
Read more
SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training
Summary
The paper introduces SLEA-RL, a novel framework designed to enhance the training of Large Language Model (LLM) agents in multi-turn tool-use tasks by leveraging step-level experience retrieval. Traditional reinforcement learning methods often operate in isolation, failing to utilize past experiences across episodes, which can hinder learning and performance. SLEA-RL addresses this limitation by retrieving relevant experiences at each decision step based on the current observation, rather than relying on a static set of experiences. The framework consists of three main components: (1) step-level observation clustering for efficient retrieval of structurally equivalent states, (2) a self-evolving experience library that distills successful strategies and failure patterns, and (3) policy optimization with step-level credit assignment for improved advantage estimation. The empirical results demonstrate that SLEA-RL significantly outperforms various reinforcement learning baselines on long-horizon multi-turn benchmarks, showcasing its effectiveness in adapting to dynamic environments.
Methodology
SLEA-RL integrates step-level experience retrieval into the training loop, clustering observations for efficient retrieval and employing a self-evolving experience library that adapts based on semantic analysis rather than gradient updates. The framework optimizes policy through fine-grained credit assignment, allowing agents to learn from relevant experiences dynamically.
Results
SLEA-RL demonstrated superior performance on benchmarks such as ALFWorld and WebShop, achieving faster convergence and higher success rates compared to existing reinforcement learning methods like GiGPO and GRPO.
Implications
The proposed framework has the potential to significantly improve the training of LLM agents in complex, dynamic environments, making it applicable to various multi-turn interaction scenarios such as web navigation, interactive search, and tool-integrated reasoning.
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
NLP
Large Language Models
Efficient ML
- AFBS-BO automates the hyperparameter tuning process for sparse attention, eliminating the need for manual grid search.
- The method achieves a 3.4Γ speedup in hyperparameter discovery and requires significantly fewer evaluations than traditional methods.
- Configurations discovered by AFBS-BO outperform existing sparse attention baselines while closely matching the quality of dense attention.
- The framework enables robust performance across diverse transformer architectures and input distributions.
Read more
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
Summary
This paper addresses the challenges of deploying sparse attention mechanisms in transformer models, which are hindered by the need for manual hyperparameter tuning. The authors propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), an automated framework that optimizes hyperparameters specific to layers and heads in sparse attention without human intervention. The method combines Bayesian Optimization for global exploration with binary search for local refinement, utilizing multi-fidelity evaluations to enhance efficiency. The results demonstrate that AFBS-BO accelerates hyperparameter discovery by 3.4 times and requires 8.8 times fewer evaluations compared to traditional grid search methods. Additionally, it identifies high-sparsity configurations that outperform existing sparse attention techniques while maintaining performance close to dense attention. This advancement transforms sparse attention from a manually tuned heuristic into a self-optimizing component, facilitating its integration across various transformer architectures and applications.
Methodology
AFBS-BO employs a hybrid algorithm consisting of three stages: (1) Bayesian Optimization for initial global exploration of the hyperparameter space using low-fidelity evaluations, (2) Binary Search Refinement for precise tuning within promising regions using high-fidelity evaluations, and (3) Multi-Input Validation to ensure robustness across diverse inputs.
Results
AFBS-BO demonstrated a 3.4Γ faster hyperparameter discovery time (3.0s vs. 10.1s) and required 8.8Γ fewer evaluations (240 vs. 2100) compared to grid search. It achieved a perplexity of 7.45 at 70.7% sparsity on WikiText-2, outperforming the state-of-the-art H2O method and coming within 0.03 PPL of the theoretical Top-K oracle.
Implications
The automation of hyperparameter tuning in sparse attention mechanisms can significantly enhance the usability and deployment of transformer models in various applications, particularly in dynamic environments where expert tuning is impractical.
Hierarchical Latent Structure Learning through Online Inference
Theory
Time Series
Efficient ML
- HOLMES model integrates hierarchical representation with online inference for latent structure learning.
- The model uses a nested Chinese Restaurant Process prior and sequential Monte Carlo methods.
- HOLMES achieves compact representations that support one-shot transfer to higher-level categories.
- In simulations, HOLMES improves predictive performance in context-dependent tasks compared to flat models.
Read more
Hierarchical Latent Structure Learning through Online Inference
Summary
This paper introduces the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, which addresses the challenge of balancing generalization and discrimination in learning systems. Traditional online latent-cause models assume flat partitions, while hierarchical Bayesian models require offline inference. HOLMES combines a nested Chinese Restaurant Process prior with sequential Monte Carlo inference to enable online hierarchical latent structure learning. The model allows for trial-by-trial inference over hierarchical representations without explicit supervision. In simulations, HOLMES demonstrated predictive performance comparable to flat models while learning more compact representations that facilitated one-shot transfer to higher-level latent categories. Additionally, in a context-dependent task with nested temporal structure, HOLMES outperformed flat models in outcome prediction, showcasing its effectiveness in discovering hierarchical structures in sequential data.
Methodology
The HOLMES model is formalized as a Bayesian nonparametric model where observations are assigned to paths through a latent tree. It combines a hierarchical prior over tree structures with sequential Monte Carlo inference, allowing for online updates of latent representations based on sequential observations. The model utilizes a nested Chinese Restaurant Process to manage the hierarchical structure and employs particle filtering for approximate posterior inference.
Results
In synthetic tasks, HOLMES matched the predictive performance of flat models while learning more compact representations. It also demonstrated improved outcome prediction in a context-dependent decision-making task, effectively capturing latent rule structures across different contexts and timescales.
Implications
The HOLMES model provides a computational framework for discovering hierarchical structures in sequential data, which could have applications in various fields such as cognitive science, artificial intelligence, and any domain requiring effective learning from complex, structured data.
Context Bootstrapped Reinforcement Learning
Reinforcement Learning
Large Language Models
- CBRL effectively addresses exploration inefficiency in RLVR by using few-shot demonstrations.
- The method employs a stochastic injection mechanism that reduces assistance over time.
- CBRL shows consistent performance improvements across diverse tasks and model families.
- The approach is algorithm-agnostic, yielding gains with different reinforcement learning algorithms.
Read more
Context Bootstrapped Reinforcement Learning
Summary
This paper introduces Context Bootstrapped Reinforcement Learning (CBRL), a novel approach designed to enhance exploration efficiency in Reinforcement Learning from Verifiable Rewards (RLVR). The authors identify that RLVR often suffers from exploration inefficiency, particularly in tasks requiring novel reasoning patterns or domain-specific knowledge. CBRL addresses this issue by incorporating few-shot demonstrations into training prompts, with a stochastic injection mechanism that gradually reduces the reliance on these demonstrations as training progresses. The method was validated across two model families and five Reasoning Gym tasks, demonstrating significant improvements in success rates and exploration efficiency. The results indicate that CBRL is algorithm-agnostic and can be effectively applied to various tasks, including a domain-specific programming language, Q, which diverges from mainstream language conventions. The findings suggest that CBRL not only enhances learning dynamics but also fosters durable behaviors in models, making it a promising approach for improving RLVR methodologies.
Methodology
CBRL augments the RLVR training process by maintaining a bank of few-shot examples that are dynamically injected into training prompts with a decreasing probability. This method leverages in-context learning capabilities of large language models to guide early exploration and gradually encourages independent performance as training progresses.
Results
CBRL consistently outperformed the GRPO-only baseline across all tested model-environment pairs, with accuracy improvements ranging from +1.3% to +22.3%. In the domain-specific Q programming language, CBRL improved test-pass rates from 27.3% to 43.0% and Pass@1 from 5.0% to 26.3%. The method demonstrated significant gains in exploration efficiency, achieving higher mean rewards early in training.
Implications
The findings suggest that CBRL can be a valuable technique for enhancing reinforcement learning applications, particularly in domains with limited pretraining data or complex reasoning requirements. Its algorithm-agnostic nature allows for broader applicability across various reinforcement learning frameworks.
Adapting Methods for Domain-Specific Japanese Small LMs: Scale, Architecture, and Quantization
NLP
Large Language Models
Efficient ML
- Optimal training scale for domain-specific adaptation is identified as 4,000 samples, balancing underfitting and overfitting.
- Llama-3 models with Japanese pre-training outperform multilingual models in technical domain tasks.
- Architecture-specific effects of quantization are documented, with Llama-3 models improving under Q4 quantization while GQA models degrade.
- A complete reproducible pipeline is provided for practitioners to replicate results on consumer hardware.
Read more
Adapting Methods for Domain-Specific Japanese Small LMs: Scale, Architecture, and Quantization
Summary
This paper introduces a systematic approach for developing domain-specific Japanese small language models (LMs) using QLoRA fine-tuning. The study addresses three critical questions: the optimal training scale, the selection of base models, and the effects of architecture-aware quantization. Through a three-stage methodology, the author first conducts scale-learning experiments, determining that a training dataset of 4,000 samples yields the best performance, with a minimum negative log-likelihood (NLL) of 1.127. In the second stage, the performance of four Japanese LLMs is compared, revealing that Llama-3 models with Japanese continual pre-training (Swallow-8B and ELYZA-JP-8B) significantly outperform multilingual models like Qwen2.5-7B. The final stage investigates the impact of quantization, finding that Llama-3 architectures benefit from Q4 K M quantization, while GQA architectures degrade under similar conditions. The paper concludes with actionable insights for practitioners, emphasizing the importance of optimal training scale, model selection, and quantization strategies for deploying compact Japanese specialist LMs on consumer hardware.
Methodology
The methodology consists of three stages: (1) conducting scale-learning experiments to determine the optimal training dataset size, (2) comparing the performance of various Japanese LLMs under identical QLoRA training conditions, and (3) analyzing the effects of quantization on model performance, focusing on architecture-specific vulnerabilities.
Results
The study finds that 4,000 training samples yield the best performance with minimal NLL. Llama-3 models with Japanese continual pre-training achieve superior scores compared to multilingual models. Additionally, Llama-3 architectures show improved performance under Q4 quantization, while GQA architectures experience significant degradation.
Implications
The findings provide essential guidance for the development and deployment of domain-specific Japanese LMs, particularly in technical fields. The insights on training scale, model selection, and quantization can help practitioners optimize performance on consumer-grade hardware.
Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations
Robotics
Theory
Optimization
- Introduction of a new metric for comparing swarm configurations that is both persistence-stable and symmetry-invariant.
- Establishment of a quotient formation space and a corresponding formation matching metric, enhancing the understanding of multi-agent systems.
- Analysis of the metric geometry reveals compactness and completeness under specific conditions, linking to classical configuration spaces.
- Identification of symmetry-mismatch and persistence-compression mechanisms that affect the expressivity of signatures.
Read more
Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations
Summary
This paper presents a novel approach to analyzing swarm and constellation reconfiguration by introducing persistence-stable, symmetry-invariant geometric representations for multi-agent configuration data. The author proposes a quotient formation space, denoted as Sn(M, G), and a formation matching metric, dM,G, which optimizes assignment errors over symmetries and relabelings. This metric serves as a structured relaxation of the GromovβHausdorff distance, ensuring that the induced inter-agent metric spaces satisfy a specific bound. The paper further explores the stability of VietorisβRips persistence, establishing that the persistence signatures can be used for effective reconfiguration monitoring. The metric geometry of the quotient space is analyzed, revealing properties such as compactness and completeness under certain assumptions. The study identifies mechanisms for symmetry-mismatch and persistence-compression, contributing to the understanding of non-injectivity in signatures. Additionally, a phase-circle model is introduced, demonstrating a conditional inverse theorem that relates the H0 signature to the formation matching metric, providing a framework for local bi-Lipschitz control. Examples on S2 and Tm illustrate practical applications in satellite constellations and formation settings.
Methodology
The methodology involves the introduction of a quotient formation space and a formation matching metric that optimizes assignment errors over symmetries and relabelings. The paper employs concepts from persistent homology and metric geometry to analyze the stability and expressivity of the proposed signatures, alongside theoretical proofs and examples to illustrate the findings.
Results
The results indicate that the proposed formation matching metric is a robust tool for monitoring swarm configurations, with established bounds relating it to the GromovβHausdorff distance. The analysis confirms the compactness and completeness of the quotient space under certain conditions, and the conditional inverse theorem provides a framework for understanding the relationship between the H0 signature and the formation matching metric.
Implications
The findings have significant implications for the design and analysis of multi-agent systems, particularly in applications involving drone swarms, satellite constellations, and mobile sensor networks. The persistence-stable metrics can enhance the monitoring and control of these systems, leading to improved performance and reliability.
OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation
Optimization
- OCP addresses embedding collapse in sparse scaling by enforcing orthogonality in the projection matrix.
- The method quantifies representation isotropy using Singular Entropy (SE) to analyze the impact of long-tail sparsity.
- Empirical results show OCP accelerates loss convergence and improves model scalability.
- Large-scale deployment on JD.com resulted in a 12.97% increase in UCXR and an 8.9% uplift in GMV.
Read more
OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation
Summary
This paper addresses the challenges faced by industrial commodity recommendation systems, particularly the issues arising from low-frequency information interference in traditional Item-Id vocabularies during sparse scaling. The authors propose a novel method called Orthogonal Constrained Projection (OCP) to optimize embedding representation. By enforcing orthogonality in the projection matrix, OCP aligns the singular value spectrum of learned embeddings with an orthogonal basis, which helps maintain high singular entropy and prevents overfitting to rare items. The methodology includes characterizing embedding collapse in sparse scaling and quantifying representation isotropy using Singular Entropy (SE). The authors validate OCP through both offline experiments and large-scale online deployment, demonstrating its effectiveness in enhancing model scalability and performance. The results indicate a significant increase in user conversion rate (UCXR) and gross merchandise volume (GMV) in a real-world setting, confirming the utility of OCP in optimizing both sparse vocabularies and dense architectures.
Methodology
The authors formalize the concept of sparse scaling and analyze embedding collapse in recommendation systems. They introduce the Orthogonal Constrained Projection method, which constrains the projection matrix on the Stiefel manifold to stabilize the singular-value distribution of embeddings. The methodology includes both theoretical analysis and empirical validation through offline experiments and large-scale online deployment.
Results
The implementation of OCP led to a 12.97% increase in user conversion rate (UCXR) and an 8.9% uplift in gross merchandise volume (GMV) during large-scale deployment on JD.com, demonstrating its effectiveness in enhancing both sparse vocabulary representation and dense architecture performance.
Implications
The findings suggest that OCP can significantly improve the performance of recommendation systems, particularly in scenarios with large and sparse item sets. This has implications for enhancing user experience and increasing sales in e-commerce platforms.
SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding
NLP
Large Language Models
Efficient ML
- SpecForge provides a scalable and efficient framework for training speculative decoding models.
- The framework supports EAGLE-3, enabling significant speed improvements in training and inference.
- SpecBundle offers a suite of high-quality draft models, addressing the scarcity of effective draft models in the community.
- The proposed methods lead to substantial reductions in inference latency without compromising output quality.
Read more
SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding
Summary
The paper introduces SpecForge, an open-source training framework designed to enhance speculative decoding for large language models (LLMs). Speculative decoding aims to reduce inference latency by utilizing a lightweight draft model to propose multiple tokens for verification by a larger target model. However, its practical application has been limited due to the scarcity of high-quality draft models and the lack of scalable training infrastructure. SpecForge addresses these challenges by implementing target-draft decoupling, hybrid parallelism, and optimized training kernels, achieving up to 9.9Γ faster training for the EAGLE-3 model on Qwen3-235B-A22B. Additionally, the authors present SpecBundle, a collection of production-grade draft models trained with SpecForge, which significantly improves the availability of high-quality drafts for the community. The results demonstrate that these draft models can achieve up to 4.48Γ end-to-end inference speedup on SGLang, establishing SpecForge as a robust foundation for deploying speculative decoding in real-world applications.
Methodology
The authors developed SpecForge by incorporating several innovative techniques, including target-draft decoupling to separate the training of draft and target models, hybrid parallelism to optimize resource utilization, and specialized training kernels to enhance performance. They also conducted a systematic study of training recipes for speculative decoding, leading to the creation of SpecBundle, a collection of draft models trained on large datasets to ensure robustness and effectiveness.
Results
SpecForge demonstrated up to 9.9Γ faster training times for the EAGLE-3 model and enabled draft models to achieve up to 4.48Γ end-to-end inference speedup on SGLang. The systematic approach to training and the release of high-quality draft models significantly improved the practical applicability of speculative decoding.
Implications
The development of SpecForge and SpecBundle has the potential to enhance the efficiency of LLMs in various applications, particularly in real-time and high-throughput scenarios. By providing a robust framework and high-quality draft models, the authors aim to facilitate broader adoption of speculative decoding techniques in both research and industry.
MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasoning Models
Large Language Models
Reinforcement Learning
Generative Models
- Introduction of MOLRGEN, a large-scale dataset for de novo molecular generation.
- Development of a diversity-aware top-k scoring mechanism for evaluating generated molecules.
- Successful training of a 24B LLM using reinforcement learning for molecular generation.
- Identification of challenges in exploring the chemical space during molecular generation.
Read more
MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasoning Models
Summary
The paper introduces MOLRGEN, a novel benchmark and dataset aimed at training and evaluating reasoning-based large language models (LLMs) for de novo molecular generation. Traditional approaches in molecular design often rely on ground-truth labels, which are not available in de novo generation where the goal is to create new molecules without prior knowledge of high-scoring candidates. The authors propose a new evaluation setting that includes a diversity-aware top-k score to assess both the quality and diversity of generated molecules. They demonstrate the effectiveness of their framework by training a 24B parameter LLM using reinforcement learning, showcasing its ability to identify high-reward molecules in unseen chemical spaces. The study highlights the challenges of exploring the chemical space and provides insights into the limitations of current methodologies.
Methodology
The authors created a large-scale dataset comprising 4.5k protein structures and associated molecular property prediction tasks. They employed reinforcement learning to train a 24B parameter LLM, utilizing a scoring function based on generated molecular properties to guide the generation process. The diversity-aware top-k score was introduced to evaluate the generated molecules based on both quality and diversity.
Results
The study demonstrated that the proposed framework could effectively train LLMs for de novo molecular generation, enabling the identification of high-reward molecules in previously unseen targets. The analysis revealed the limitations of the approach, particularly regarding the exploration of the chemical space and the challenges in generating diverse molecular candidates.
Implications
The findings suggest that reasoning-based LLMs can significantly enhance the de novo molecular generation process, potentially accelerating drug discovery and molecular design. The introduction of a structured evaluation framework may lead to improved methodologies in computational chemistry and related fields.
Epistemic Generative Adversarial Networks
Generative Models
Theory
Interpretability
- Introduction of Epistemic Generative Adversarial Networks (EGANs) to enhance output diversity in GANs.
- Utilization of Dempster-Shafer theory for uncertainty quantification in generative processes.
- Modification of the discriminator to predict belief functions instead of probability distributions.
- Architectural enhancements to the generator for region-wise uncertainty estimation.
Read more
Epistemic Generative Adversarial Networks
Summary
This paper addresses the challenge of output diversity in Generative Adversarial Networks (GANs), which often produce similar samples due to mode collapse. The authors propose a novel approach called Epistemic Generative Adversarial Networks (EGANs) that incorporates Dempster-Shafer theory of evidence into the GAN framework. This approach modifies the loss function for both the generator and discriminator, allowing them to predict belief functions instead of traditional probability distributions. Additionally, the generator is enhanced to estimate region-wise uncertainty for each pixel, enabling it to quantify uncertainty in its outputs. The experimental results demonstrate that EGANs significantly improve generation variability and provide a principled framework for modeling uncertainty in generative processes. This work represents a significant advancement in the field of generative models, offering new methods to create more robust and interpretable outputs.
Methodology
The authors extend the traditional GAN framework by integrating Dempster-Shafer theory, modifying both the generator and discriminator to predict belief functions. The generator is architecturally enhanced to allow for region-wise uncertainty estimation, and a generalized GAN loss function is developed based on belief functions.
Results
The experimental results indicate that EGANs outperform traditional GANs in terms of output diversity and variability. The approach successfully quantifies uncertainty in generated samples, leading to more representative outputs and a better understanding of the generative process.
Implications
The findings suggest that EGANs can be particularly beneficial in applications requiring diverse outputs, such as medical imaging and other fields where uncertainty quantification is crucial. This work opens avenues for further research into robust generative models that can handle uncertainty more effectively.
SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks
Theory
Interpretability
Time Series
- SINDy-KANs combine the strengths of KANs and SINDy for improved interpretability.
- The framework allows for symbolic regression at the activation function level.
- SINDy-KANs facilitate the discovery of sparse, interpretable equations from data.
- The methodology is validated through experiments on various dynamical systems.
Read more
SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks
Summary
The paper introduces SINDy-KANs, a novel framework that integrates Kolmogorov-Arnold networks (KANs) with the Sparse Identification of Nonlinear Dynamics (SINDy) approach to enhance the interpretability of machine learning models. KANs utilize trainable activation functions to approximate multivariate functions, but their outputs can lack interpretability due to complexity. SINDy, on the other hand, is effective in learning sparse equations from data but is limited by its predefined library of functions. SINDy-KANs address these limitations by allowing for symbolic regression at the level of each activation function within KANs, thus enabling the discovery of parsimonious and interpretable equations while maintaining the capacity for complex function compositions. The authors demonstrate the effectiveness of SINDy-KANs through various symbolic regression tasks, showcasing accurate equation discovery across different dynamical systems. This work represents a significant step towards creating interpretable machine learning models that can effectively capture the dynamics of complex systems.
Methodology
The SINDy-KANs framework simultaneously trains a KAN and applies a SINDy-like representation to each activation function. This dual approach allows for the learning of complex function compositions while ensuring that the resulting equations are sparse and interpretable. The methodology involves minimizing a loss function that captures the difference between the KAN's output and the provided data, while also enforcing sparsity in the learned representations.
Results
The experiments conducted demonstrate that SINDy-KANs successfully identify accurate and interpretable equations for a variety of dynamical systems. The results indicate that the integration of SINDy with KANs significantly enhances the interpretability of the learned models compared to traditional KAN approaches, which often yield complex and less interpretable outputs.
Implications
The SINDy-KANs framework has significant implications for fields requiring interpretable machine learning models, such as physics, engineering, and biology. By enabling the discovery of sparse and interpretable equations, this approach can facilitate better understanding and control of complex dynamical systems, potentially leading to advancements in model predictive control, system identification, and other applications where interpretability is crucial.
AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Reinforcement Learning
Multimodal
Robotics
- AcceRL eliminates synchronization barriers in reinforcement learning, maximizing hardware utilization.
- The framework integrates a trainable world model, allowing for improved sample efficiency by 200x.
- AcceRL achieves state-of-the-art performance on the LIBERO benchmark across all evaluation categories.
- The architecture demonstrates super-linear scaling in throughput with increased computational resources.
Read more
AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Summary
The paper introduces AcceRL, a novel asynchronous reinforcement learning (RL) framework designed to enhance the efficiency of Vision-Language-Action (VLA) models. Traditional RL approaches face significant challenges due to synchronization barriers that limit throughput and increase latency. AcceRL addresses these issues by decoupling training, inference, and rollouts, allowing for independent asynchronous streams. A key innovation of AcceRL is the integration of a trainable world model that generates virtual experiences, significantly improving sample efficiency and training stability. The framework demonstrates super-linear scaling in throughput and achieves state-of-the-art performance on the LIBERO benchmark, showcasing its effectiveness in complex control tasks. The results indicate that AcceRL can enhance data utilization and reduce reliance on costly real-world interactions, marking a significant advancement in the field of embodied artificial intelligence.
Methodology
AcceRL employs a fully asynchronous and decoupled architecture that separates training, inference, and sampling processes. It incorporates a plug-and-play world model that generates synthetic experiences, allowing for continuous refinement of both the world model and the policy. This design mitigates the limitations of traditional synchronous RL frameworks, enhancing throughput and efficiency.
Results
AcceRL achieved state-of-the-art performance on the LIBERO benchmark, demonstrating significant improvements in sample efficiency and training stability. The framework exhibited super-linear scaling in throughput, indicating effective utilization of computational resources. The world model's integration led to a remarkable 200x improvement in online sample efficiency.
Implications
The advancements presented in AcceRL could lead to more efficient training of VLA models, reducing the need for extensive real-world data and interactions. This framework has potential applications in various domains, including robotics, autonomous systems, and any task requiring complex decision-making based on visual and linguistic inputs.
ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis
Audio & Speech
- ALIGN is designed to improve the generalization of BCIs for speech decoding across different sessions.
- The framework employs adversarial learning to align session representations and extract invariant features.
- Evaluation results show significant reductions in phoneme and word error rates compared to existing methods.
- The approach addresses the challenges posed by nonstationarities in neural recordings, enhancing long-term usability of BCIs.
Read more
ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis
Summary
The paper introduces ALIGN, a novel framework designed to enhance the generalization capabilities of intracortical brain-computer interfaces (BCIs) for speech decoding. Traditional BCIs often struggle with cross-session variability due to factors such as electrode shifts and neural turnover, leading to performance degradation when models are deployed in new sessions. ALIGN addresses this challenge through a multi-domain adversarial learning approach that promotes the extraction of session-invariant features while maintaining task-relevant information. The framework incorporates a feature encoder, a phoneme classifier, and a domain classifier, which are trained jointly using adversarial optimization. This method effectively suppresses session-specific cues, enabling robust performance across different recording sessions. The authors evaluate ALIGN on intracortical speech decoding tasks and demonstrate significant improvements in phoneme and word error rates compared to baseline models, indicating its effectiveness in mitigating session-level distribution shifts and enhancing the reliability of BCIs for longitudinal use.
Methodology
ALIGN utilizes a multi-domain adversarial learning framework that consists of a feature encoder, a phoneme classifier, and a domain classifier. The adversarial optimization encourages the encoder to learn session-invariant features while preserving relevant task information. Additionally, the framework incorporates Temporal Stretch Augmentation (TSA) to enhance robustness against temporal nonstationarities in neural data.
Results
The application of ALIGN resulted in consistent improvements in both phoneme error rate and word error rate when evaluated on intracortical speech decoding tasks. The framework demonstrated superior generalization capabilities to previously unseen sessions, outperforming baseline models and effectively addressing the challenges of session-dependent distribution shifts.
Implications
The findings suggest that ALIGN could significantly enhance the usability of BCIs for individuals with severe communication impairments, reducing the need for frequent recalibration and improving the reliability of speech neuroprostheses in real-world applications.
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
Multimodal
- AI agents struggle with domain-specific reasoning and multimodal integration.
- Human expertise is essential for diagnosing modeling failures and making strategic decisions.
- Human-AI collaboration outperforms both humans and AI working independently.
- The benchmark challenges are designed to reward domain-specific insights over generic methods.
Read more
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
Summary
The AgentDS Technical Report presents a benchmark and competition aimed at evaluating the performance of AI agents and human-AI collaboration in domain-specific data science tasks. The study introduces 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. The competition involved 29 teams and 80 participants, allowing for a systematic comparison between human-AI collaborative approaches and AI-only baselines. The findings reveal that current AI agents struggle with domain-specific reasoning, often performing below the median of human participants. The strongest solutions emerged from human-AI collaboration, emphasizing the importance of human expertise in data science. The report highlights the limitations of AI in fully automating data science workflows and suggests that effective performance in domain-specific tasks relies on human insight and strategic decision-making. The AgentDS benchmark serves as a tool for further research into enhancing human-AI collaboration in data science.
Methodology
The study utilized a competition format with 17 challenges designed to test domain-specific data science tasks. Participants were encouraged to employ both AI agents and human expertise, with a focus on integrating multimodal data sources. The evaluation framework compared the performance of human-AI collaborations against AI-only approaches.
Results
The results indicated that AI-only solutions performed near or below the median of human participants, while the most successful approaches involved human-AI collaboration. AI agents showed significant limitations in tasks requiring domain-specific reasoning, leading many teams to favor human-guided workflows.
Implications
The findings suggest that while AI can assist in data science, human expertise remains crucial for effective problem formulation and decision-making. This has implications for the design of future AI systems that aim to support rather than replace human data scientists.
Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees
Theory
- Establishes a theoretical link between residual-based training objectives and solution space error for PINNs.
- Derives generalization bounds that can be computed without access to the true solution.
- Demonstrates the necessity of structural regularity in addition to residual control for reliable neural PDE solutions.
- Provides formal verification of error bounds across multiple PDE types, ensuring empirical stability.
Read more
Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees
Summary
This paper addresses the challenge of uncertainty quantification in neural PDE solvers, particularly physics-informed neural networks (PINNs), which traditionally rely on minimizing residual losses at collocation points. The authors establish a theoretical framework that connects residual control to solution-space error, proving that when neural approximations are confined within a compact subset of the solution space, a vanishing residual error guarantees convergence to the true solution. They derive both deterministic and probabilistic convergence results, providing certified generalization bounds that translate residual, boundary, and initial errors into explicit solution error guarantees. The paper also includes numerical experiments on various types of PDEs, demonstrating the practical applicability of their theoretical findings and the effectiveness of their error certification approach.
Methodology
The authors utilize compactness arguments to conduct a convergence analysis of PINNs. They derive generalization bounds that relate various error sources (residual, boundary, initial) to the true solution error. The methodology includes formal verification techniques to certify the derived error bounds through numerical experiments on ordinary and partial differential equations.
Results
The main results include the establishment of convergence guarantees under compactness assumptions, showing that a vanishing residual error implies convergence to the true solution. The authors provide explicit generalization bounds that can be computed without knowing the true solution, and their numerical experiments validate the effectiveness of these bounds in practice.
Implications
The findings have significant implications for the reliability of neural PDE solvers in scientific computing, engineering, and other fields where PDEs are prevalent. The rigorous error certification framework enhances the trustworthiness of neural network-based solutions, enabling better decision-making based on these models.
Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives
Theory
- Introduces best-of-both-worlds algorithms for multi-dueling bandits under Condorcet and Borda objectives.
- MetaDueling achieves optimal pseudo-regret in adversarial and stochastic settings simultaneously.
- SA-MiDEX adapts from stochastic to adversarial strategies based on observed deviations.
- Establishes matching lower bounds for the proposed algorithms, confirming their optimality.
Read more
Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives
Summary
This paper addresses the challenge of developing a single algorithm that can optimally handle both stochastic and adversarial environments in multi-dueling bandits, a framework where multiple arms are compared, and only the winner is observed. The authors introduce two novel algorithms: MetaDueling for the Condorcet setting and SA-MiDEX for the Borda setting. MetaDueling is a black-box reduction that transforms any dueling bandit algorithm into a multi-dueling bandit algorithm, achieving O(βKT) pseudo-regret against adversarial preferences and instance-optimal stochastic pseudo-regret without prior knowledge of the environment. SA-MiDEX employs a stochastic-and-adversarial approach, yielding competitive regret bounds in both stochastic and adversarial settings. The authors also provide matching lower bounds for their results, demonstrating the optimality of their algorithms. This work represents a significant advancement in the field of online learning, particularly in applications such as ranking and recommendation systems, where understanding user preferences is crucial.
Methodology
The authors propose two algorithms: MetaDueling, which reduces multi-dueling to dueling bandits by extracting unbiased pairwise comparison information, and SA-MiDEX, which employs a successive elimination strategy to estimate performance while monitoring for adversarial deviations. The algorithms are evaluated against both stochastic and adversarial preferences, with rigorous proofs of upper and lower bounds for regret.
Results
MetaDueling achieves O(βKT) pseudo-regret against adversarial preferences and instance-optimal stochastic pseudo-regret. SA-MiDEX achieves O(KΒ² log KT + K logΒ² T + Ξ£_i:βB_i >0 K log KT/(βB_i)Β²) regret in stochastic environments and O(KβT log KT + K^(1/3)T^(2/3)(log K)^(1/3)) against adversaries. The authors also prove matching lower bounds for both settings, establishing the optimality of their results.
Implications
The proposed algorithms can significantly enhance the performance of ranking and recommendation systems by effectively adapting to varying environments. This work opens avenues for further research in online learning and preference-based systems, potentially leading to more robust and efficient algorithms in real-world applications.
Towards Noise-Resilient Quantum Multi-Armed and Stochastic Linear Bandits
Theory
Optimization
- Introduction of a noise-robust QMC algorithm (BQMC) that improves estimation accuracy in noisy quantum environments.
- Development of noise-resilient quantum bandit algorithms (NR-QUCB and NR-QLinUCB) that maintain performance advantages over classical methods.
- Extensive experimental validation showing improved regret performance under multiple quantum noise models.
Read more
Towards Noise-Resilient Quantum Multi-Armed and Stochastic Linear Bandits
Summary
This paper addresses the challenges of implementing quantum multi-armed bandits (QMAB) and stochastic linear bandits (QSLB) in the presence of noise, which is prevalent in current noisy intermediate-scale quantum (NISQ) devices. The authors propose a noise-robust quantum Monte Carlo (QMC) algorithm based on Bayesian estimation, termed BQMC, which enhances the accuracy of reward estimations in noisy environments. Building on this foundation, they develop two new algorithms: noise-resilient quantum upper confidence bound (NR-QUCB) and noise-resilient quantum LinUCB (NR-QLinUCB). These algorithms maintain logarithmic regret performance while effectively mitigating the impact of noise. The paper includes extensive simulations that demonstrate the improved performance of the proposed algorithms across various quantum noise models, showcasing their potential to outperform classical methods in practical applications.
Methodology
The authors developed a Bayesian estimation-based noise-robust QMC algorithm (BQMC) to enhance reward estimation accuracy in the presence of noise. They then integrated this algorithm into quantum bandit frameworks, resulting in NR-QUCB and NR-QLinUCB algorithms. The performance of these algorithms was evaluated through extensive simulations under various noise models.
Results
The proposed noise-resilient quantum bandit algorithms demonstrated significant improvements in estimation accuracy and reduced regret compared to existing quantum bandit methods, particularly in noisy settings. The simulations confirmed that the algorithms maintain logarithmic regret behavior while effectively addressing the challenges posed by NISQ noise.
Implications
The findings suggest that the proposed noise-resilient quantum bandit algorithms could enhance decision-making processes in various applications, such as recommendation systems, online advertising, and adaptive control, by leveraging quantum computing's advantages even in noisy environments.
Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks
Graph Learning
Optimization
Theory
- Introduces unlearning corruption attacks that exploit the unlearning process in GNNs.
- Formulates the attack as a bi-level optimization problem to address black-box unlearning challenges.
- Demonstrates that carefully designed unlearning requests can lead to significant accuracy degradation.
- Raises concerns about the robustness of GNNs under real-world regulatory demands.
Read more
Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks
Summary
This paper addresses the vulnerabilities of Graph Neural Networks (GNNs) in the context of approximate graph unlearning, which is essential for complying with privacy regulations. The authors introduce the concept of 'unlearning corruption attacks,' where adversaries inject specific nodes into the training graph and subsequently request their deletion. This process exploits the performance degradation that occurs during unlearning, leading to significant accuracy loss in the GNN. The authors formulate this attack as a bi-level optimization problem, using gradient-based updates and a surrogate model to generate pseudo-labels for optimization. Through extensive experiments, they demonstrate that small, strategically crafted unlearning requests can severely degrade model performance, highlighting the urgent need for improved robustness in GNN unlearning methods.
Methodology
The authors propose a novel framework for unlearning corruption attacks, mathematically formalizing the attacker's problem as a bi-level optimization task. They approximate the unlearning process using gradient-based updates and utilize a surrogate model to generate pseudo-labels, enabling the optimization of node injection strategies that maximize performance degradation post-unlearning.
Results
The experiments conducted across various benchmarks and unlearning algorithms reveal that even minor, well-crafted unlearning requests can lead to substantial declines in GNN accuracy. This finding underscores the vulnerabilities of current unlearning methods and the potential for adversarial exploitation.
Implications
The findings of this study have significant implications for the design of GNNs and unlearning algorithms, emphasizing the need for enhanced robustness against adversarial attacks. It also raises awareness about the security risks associated with compliance-driven unlearning processes in real-world applications.
From ex(p) to poly: Gaussian Splatting with Polynomial Kernels
Computer Vision
Efficient ML
- Introduction of an N-th-order polynomial kernel approximation for Gaussian Splatting.
- Compatibility with existing datasets while significantly reducing computational overhead.
- Demonstrated performance improvements of 4% to 15% with minimal impact on image quality.
- Formal mathematical derivation proving invariance of anti-aliasing normalization factors.
Read more
From ex(p) to poly: Gaussian Splatting with Polynomial Kernels
Summary
This paper addresses the limitations of existing Gaussian Splatting (3DGS) techniques, which have seen performance improvements through various kernel modifications. However, many of these modifications are incompatible with datasets optimized for the original Gaussian kernel, hindering their widespread adoption. The authors propose a new polynomial kernel approximation that retains compatibility with existing datasets while enhancing computational efficiency. By replacing the original exponential kernel with a polynomial approximation combined with a ReLU function, the authors achieve more aggressive culling of splats, resulting in performance improvements of 4% to 15% without significant degradation in image quality. The paper includes a mathematical analysis of the new kernel, demonstrating its advantages for 3DGS implementations, particularly on NPU hardware. The authors also present a methodology for fitting polynomial coefficients optimized for real-world rendering scenarios, ensuring practical applicability of their approach.
Methodology
The authors developed a polynomial kernel approximation for Gaussian Splatting, focusing on maintaining compatibility with existing datasets. They derived a universal bounding radius for splats and implemented a sampling strategy for fitting polynomial coefficients. The performance of the new kernel was evaluated across multiple 3DGS implementations and rendering APIs, with a focus on optimizing runtime performance and image quality.
Results
The proposed polynomial kernel achieved performance improvements ranging from 4% to 15% in various 3DGS implementations, with negligible effects on image quality. The mathematical analysis confirmed the invariance of anti-aliasing normalization factors across different kernel functions, supporting the robustness of the new approach.
Implications
This work has significant implications for real-time rendering applications, particularly in scenarios where computational efficiency is critical. The compatibility of the new kernel with existing datasets facilitates easier adoption of improved techniques in practical applications, potentially enhancing the performance of 3D rendering systems across various platforms.
Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend
Federated Learning
Theory
- Existing assumptions about the effectiveness of bottom models in representing labels are misleading.
- Mutual information analysis reveals that bottom models focus on feature extraction, while top models handle label mapping.
- The 'model compensation' phenomenon highlights the vulnerabilities of LIAs in VFL.
- A novel defense technique involving cut layer adjustment significantly reduces LIA attack accuracy.
Read more
Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend
Summary
This paper addresses the vulnerabilities of label inference attacks (LIAs) in vertical federated learning (VFL), where passive parties attempt to infer private labels held by an active party. The authors challenge the prevailing assumption that well-trained bottom models can effectively represent labels, revealing that the mutual information between layer outputs and labels increases with layer depth. This indicates that bottom models primarily focus on feature extraction, while the top model is responsible for label mapping. The study introduces the 'model compensation' phenomenon, where the increasing number of passive parties leads to a diminished label representation capability in bottom models, thereby amplifying the top model's role in label mapping. To demonstrate the vulnerability of existing LIAs, the authors conduct task reassignment experiments that disrupt the alignment between features and labels, showing that the success of LIAs is largely dependent on this alignment. They propose a novel defense strategy involving the advancement of the cut layer in the model architecture, which effectively reduces LIA attack accuracy and enhances predictive performance. Extensive experiments across multiple datasets and model architectures validate the effectiveness of this defense technique.
Methodology
The authors utilize mutual information analysis to quantify the correlation between layer outputs and labels in VFL. They conduct task reassignment experiments to disrupt feature-label alignment and evaluate the impact on LIA success. A defense technique is proposed that involves advancing the cut layer in the model architecture, and its effectiveness is tested across various datasets and model architectures.
Results
The experiments demonstrate that advancing the cut layer by one layer significantly improves resistance to LIAs, while advancing it by three layers reduces attack accuracy to random guessing levels. The proposed defense technique also enhances the predictive performance of VFL models, indicating its dual benefits for privacy and accuracy.
Implications
The findings suggest that existing LIA methods may be fundamentally flawed due to their reliance on misleading assumptions. The proposed defense technique offers a practical approach to enhance privacy in VFL settings, with potential applications in sensitive data environments where label confidentiality is critical.
Automatic Configuration of LLM Post-Training Pipelines
Large Language Models
Reinforcement Learning
Optimization
- AutoPipe is a budget-aware framework for LLM post-training configuration selection.
- It employs a dataset-conditioned ranking surrogate to provide transferable guidance across datasets.
- The framework adapts online using Bayesian optimization and a Gaussian-process residual surrogate.
- Early-stopping mechanisms are implemented to minimize evaluation costs.
Read more
Automatic Configuration of LLM Post-Training Pipelines
Summary
The paper introduces AutoPipe, a novel framework designed to automate the configuration of post-training pipelines for large language models (LLMs). These pipelines typically involve a combination of supervised fine-tuning (SFT) and reinforcement learning (RL), which can be challenging to optimize due to the high-dimensional and heterogeneous nature of the configuration space. AutoPipe addresses these challenges through a two-stage approach: an offline phase where it learns a dataset-conditioned learning-to-rank surrogate from historical runs, and an online phase where it utilizes this guidance to perform Bayesian optimization tailored to new datasets. The framework also incorporates an early-stop predictor to reduce evaluation costs by assessing candidates based on early training signals. Experimental results demonstrate that AutoPipe outperforms traditional offline-only baselines and achieves competitive performance with leading online hyperparameter optimization methods while utilizing less than 10% of their computational resources.
Methodology
AutoPipe operates in two phases: an offline phase where it learns a ranking surrogate from historical data to guide configuration selection, and an online phase where it applies Bayesian optimization to refine configurations for new datasets. It also uses an early-stop predictor to evaluate candidates based on early training signals, reducing the need for full evaluations.
Results
In experiments focused on biomedical reasoning tasks, AutoPipe consistently outperformed offline-only baselines and matched the performance of the best online hyperparameter optimization methods, achieving these results with less than 10% of the computational cost typically required.
Implications
The development of AutoPipe has significant implications for practitioners working with LLMs, as it offers a more efficient method for configuring post-training pipelines, potentially leading to faster deployment and reduced costs in real-world applications.
Robustness, Cost, and Attack-Surface Concentration in Phishing Detection
Theory
Optimization
- High accuracy in phishing detection does not guarantee robustness against feature manipulation.
- The proposed cost-aware evasion framework reveals critical insights into the economics of feature edits.
- Robustness is significantly influenced by a small number of low-cost features, highlighting the need for strategic feature selection.
- The concept of action-set-limited invariance indicates that improving robustness requires changes to feature representation or cost models.
Read more
Robustness, Cost, and Attack-Surface Concentration in Phishing Detection
Summary
This paper addresses the gap between high accuracy of phishing detection models under i.i.d. evaluation and their robustness against post-deployment feature manipulation. The authors propose a cost-aware evasion framework that models discrete feature edits under attacker budgets, introducing three diagnostics: minimal evasion cost (MEC), evasion survival rate (S(B)), and robustness concentration index (RCI). The study employs the UCI Phishing Websites benchmark, demonstrating that popular classifiers like Logistic Regression and Random Forests achieve high AUC scores but exhibit vulnerabilities when faced with budgeted evasion strategies. The findings reveal that over 80% of successful evasions concentrate on a small number of low-cost features, indicating that robustness is influenced more by feature economics than by model complexity. The paper formalizes a convergence phenomenon where no classifier can improve robustness quantiles without altering the feature representation or cost model, emphasizing the importance of understanding the economic aspects of feature manipulation in adversarial settings.
Methodology
The authors model post-deployment evasion as a shortest-path problem on a directed graph, where nodes represent discrete feature vectors and edges represent admissible monotone manipulations with associated costs. They evaluate classifiers under bounded attacker budgets, focusing on sanitization-style evasion that removes phishing indicators.
Results
The study finds that across various models, the median minimal evasion cost (MEC) is 2, with over 80% of successful evasions concentrating on just three low-cost features. The research establishes that if a significant number of phishing instances can be evaded through minimal-cost feature transitions, no classifier can enhance robustness without modifying the feature space or cost model.
Implications
The findings suggest that improving phishing detection systems requires a deeper understanding of feature economics and the strategic selection of features to mitigate vulnerabilities. This research can inform the design of more robust detection systems that account for adversarial manipulation.
MLOW: Interpretable Low-Rank Frequency Magnitude Decomposition of Multiple Effects for Time Series Forecasting
Time Series
- MLOW provides a novel frequency-based decomposition approach for time series forecasting.
- Introduces Hyperplane-NMF to enhance interpretability and efficiency in low-rank decomposition.
- Demonstrates robustness to noise and effective separation of multiple effects in time series.
- Allows for flexible selection of input horizons and frequency levels to address spectral leakage.
Read more
MLOW: Interpretable Low-Rank Frequency Magnitude Decomposition of Multiple Effects for Time Series Forecasting
Summary
The paper introduces MLOW, a novel framework for time series forecasting that focuses on interpretable low-rank frequency magnitude decomposition of multiple effects. Traditional time series forecasting models often struggle with effectively separating trends, seasonal patterns, and residuals due to their reliance on smoothing techniques, which can be sensitive to noise and fail to disentangle intertwined effects. MLOW addresses these challenges by proposing a frequency-based decomposition approach that represents a time series as a magnitude spectrum multiplied by phase-aware basis functions. The authors explore various low-rank methods, including PCA, NMF, and Semi-NMF, and identify their limitations in achieving an interpretable and generalizable decomposition. To overcome these issues, they introduce Hyperplane-NMF, a new low-rank method that ensures interpretability and efficiency. MLOW also incorporates a flexible mechanism for selecting input horizons and frequency levels to mitigate spectral leakage. The framework demonstrates robust performance in hierarchical multiple-effect decomposition and can be integrated into existing time series forecasting models with minimal modifications, leading to significant improvements in forecasting accuracy.
Methodology
MLOW employs a frequency-based decomposition strategy that utilizes Fourier basis expansion to represent time series data as a magnitude spectrum and phase-aware basis functions. It introduces Hyperplane-NMF to achieve low-rank representation while ensuring interpretability and efficiency. The framework allows for flexible input horizon and frequency level selection to mitigate spectral leakage.
Results
MLOW shows significant improvements in forecasting accuracy compared to traditional smoothing-based methods. The visual analysis confirms its capability for interpretable and hierarchical decomposition of multiple effects, demonstrating robustness against noise.
Implications
The proposed MLOW framework can enhance the interpretability and accuracy of time series forecasting across various applications, including demand prediction, financial risk assessment, and environmental monitoring. Its integration into existing models can lead to better performance in real-world scenarios.
Data-efficient pre-training by scaling synthetic megadocs
NLP
Large Language Models
Efficient ML
- Synthetic data augmentation can significantly improve pre-training efficiency in language models.
- The introduction of 'megadocs' enhances data efficiency and model performance.
- Optimal mixing and epoching strategies are crucial for leveraging synthetic data effectively.
- The study demonstrates that improvements in loss scaling are more pronounced with increased synthetic data generation.
Read more
Data-efficient pre-training by scaling synthetic megadocs
Summary
This paper explores the use of synthetic data augmentation to enhance the efficiency of pre-training language models, particularly when real data is limited. The authors propose algorithms that improve loss scaling, demonstrating that mixing web data with synthetically generated rephrases can lower i.i.d. validation loss, even when the synthetic data comes from a different distribution. They introduce the concept of 'megadocs,' which are constructed by stitching synthetic rephrases or inserting rationales into documents, leading to significant improvements in data efficiency. The study shows that using megadocs can increase data efficiency from 1.48Γ to 1.80Γ at 32 generations per document, with notable gains in long-context loss and downstream benchmark accuracy. The findings indicate that synthetic data can be effectively utilized to model the original data distribution, providing a pathway for more efficient pre-training in scenarios where data is scarce.
Methodology
The authors conducted experiments using a fixed dataset of web text and a synthetic data generator to evaluate how different configurations of synthetic data (rephrases and megadocs) affect i.i.d. validation loss. They employed two methods for constructing megadocs: stitching rephrases from the same document and inserting rationales to stretch documents. The performance was measured by tracking loss changes as the number of synthetic generations increased.
Results
The study found that pre-training with synthetic rephrases improved i.i.d. validation loss, achieving a plateau at 3.41 loss with 32 rephrases compared to a baseline of 3.55. The introduction of megadocs further improved data efficiency to 1.80Γ at 32 generations, with downstream benchmark accuracy increasing by 6% to 9% over the baseline. The results indicate that the benefits of synthetic data algorithms compound with increased generation counts.
Implications
The findings suggest that leveraging synthetic data can be a viable strategy for training language models in data-constrained environments. This approach could lead to more efficient use of computational resources and better model performance, particularly in applications requiring long-context understanding.
BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery
Interpretability
- BVSIMC improves predictive accuracy and interpretability in drug discovery by incorporating variable selection from side features.
- The model utilizes spike-and-slab priors to filter out irrelevant or noisy side information.
- BVSIMC outperforms several existing methods in both synthetic and real-world drug discovery applications.
- The approach reveals clinically meaningful side features that can guide drug development.
Read more
BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery
Summary
The paper introduces BVSIMC, a novel Bayesian model designed for drug discovery that incorporates variable selection from side information, such as chemical properties and genomic data. The model addresses the challenges posed by high-dimensional and noisy side features, which can hinder predictive performance. By employing spike-and-slab priors, BVSIMC effectively filters out irrelevant side features, enhancing both predictive accuracy and interpretability. The authors validate their approach through simulation studies and two practical applications: predicting drug resistance in Mycobacterium tuberculosis and identifying new drug-disease associations in drug repositioning. BVSIMC demonstrates superior performance compared to existing state-of-the-art methods, revealing clinically significant side features in the process.
Methodology
BVSIMC employs a Bayesian framework for inductive matrix completion, utilizing spike-and-slab priors for variable selection. This allows the model to perform selective shrinkage of side feature effects, isolating the most relevant features for predicting drug-disease interactions while managing the noise in high-dimensional data.
Results
The results indicate that BVSIMC significantly outperforms existing methods in terms of prediction accuracy in both simulation studies and real-world applications, specifically in predicting drug resistance in Mycobacterium tuberculosis and in drug repositioning tasks. The model also successfully identifies the most clinically relevant side features.
Implications
The findings suggest that BVSIMC can enhance the efficiency and effectiveness of drug discovery processes, particularly in predicting drug resistance and repositioning existing drugs. The model's ability to filter and interpret side information could lead to more targeted and successful drug development strategies.
Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment
Theory
- Introduction of CALM framework for embedding alignment in treatment effect estimation.
- Derivation of finite-sample risk bounds that clarify when embedding alignment is superior to imputation.
- Demonstration of CALM's effectiveness through extensive simulations, particularly in nonlinear settings.
Read more
Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment
Summary
This paper addresses the challenge of estimating heterogeneous treatment effects (HTEs) in randomized controlled trials (RCTs) when there is a mismatch in covariates between RCTs and large observational studies (OS). The authors propose a novel framework called CALM (Calibrated ALignment under covariate Mismatch), which avoids traditional imputation methods by learning embeddings that map features from both sources into a common representation space. This allows for the transfer of outcome models from the OS to the RCT embedding space while maintaining causal identification through randomization. The authors derive finite-sample risk bounds that highlight the advantages of embedding alignment over imputation, particularly under linear Gaussian models. Extensive simulations across 51 settings demonstrate that CALM outperforms existing methods, especially in nonlinear scenarios, thus providing a robust approach for CATE estimation in the presence of covariate mismatch.
Methodology
The CALM framework integrates representation learning into the R-OSCAR calibration pipeline. It learns embedding functions to map covariates from RCTs and OS into a shared representation space, allowing for the transfer and calibration of outcome models without direct imputation of missing covariates. The authors also derive risk bounds that account for alignment error and model complexities.
Results
Simulations across 51 settings confirm that CALM's calibration-based methods are equivalent for linear CATEs and significantly outperform traditional imputation methods in nonlinear scenarios, achieving superior performance in all tested cases.
Implications
The proposed CALM framework has significant implications for precision medicine, as it enhances the ability to estimate treatment effects in diverse patient populations by effectively integrating data from RCTs and observational studies, thus facilitating more personalized treatment strategies.
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Interpretability
- Extreme sparsification leads to a significant collapse in local feature interpretability despite stable global representation quality.
- The phenomenon of interpretability collapse is intrinsic to the sparsification process, not limited to specific algorithms or training durations.
- The collapse scales with dataset complexity, indicating more severe interpretability issues for complex real-world data.
Read more
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Summary
This paper investigates the relationship between extreme neural network sparsification and mechanistic interpretability, particularly focusing on whether interpretable features survive under severe capacity constraints. The authors introduce a novel adaptive sparsity scheduling framework that progressively reduces the number of active neurons in hybrid Variational AutoencoderβSparse Autoencoder (VAE-SAE) architectures from 500 to 50 over 50 training epochs. The study provides empirical evidence of a paradox where global representation quality, measured by Mutual Information Gap (MIG), remains stable while local feature interpretability collapses significantly. The experiments conducted on two benchmark datasets, dSprites and Shapes3D, reveal alarming rates of dead neuronsβ34.4% on dSprites and 62.7% on Shapes3D under Top-k sparsification, and even worse under L1 regularization. The findings suggest that interpretability collapse is intrinsic to the compression process and scales with dataset complexity, indicating that as datasets become more complex, the interpretability of neural networks is likely to degrade further. This work highlights critical concerns for deploying interpretable AI systems in resource-constrained environments, especially as regulatory frameworks increasingly demand explainability.
Methodology
The authors employed a hybrid VAE-SAE architecture and introduced an adaptive sparsity scheduling framework to progressively reduce active neurons. They conducted experiments using both Top-k and L1 sparsification methods across two datasets, measuring global and local interpretability metrics to analyze the effects of extreme sparsification.
Results
The study found that under Top-k sparsification, dead neuron rates reached 34.4% on dSprites and 62.7% on Shapes3D, while L1 regularization resulted in 41.7% and 90.6% dead neurons, respectively. Extended training did not recover dead neurons, and the collapse was consistent across various threshold definitions.
Implications
These findings raise concerns about the viability of deploying interpretable AI in resource-constrained environments, especially as the demand for explainability in AI systems grows. The results suggest that extreme sparsification may compromise the interpretability of neural networks, necessitating further research into maintaining interpretability under compression.
SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data
Interpretability
- SHAPCA combines PCA and SHAP to improve interpretability of machine learning models on spectroscopy data.
- The framework allows for explanations in the original input space, enhancing practical applicability.
- SHAPCA provides both global and local perspectives on model predictions.
- The method demonstrates improved consistency and stability in feature importance across training runs.
Read more
SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data
Summary
The paper presents SHAPCA, an innovative framework designed to enhance the interpretability of machine learning models applied to high-dimensional spectroscopy data. Spectroscopy is widely used in various scientific fields, but the complexity and collinearity of the data pose significant challenges for model explainability, particularly in clinical and safety-critical applications. The authors argue that traditional feature extraction methods, such as Principal Component Analysis (PCA), often obscure the connection between model predictions and the original spectral data. SHAPCA addresses this issue by integrating PCA for dimensionality reduction with Shapley Additive Explanations (SHAP) for post-hoc interpretability. This approach allows for the generation of explanations in the original input space, making it easier for practitioners to relate model outputs to biological components. The framework provides both global and local analysis capabilities, revealing the spectral bands that influence overall model behavior and instance-specific predictions. The authors demonstrate that SHAPCA yields more consistent and interpretable results across different training runs, thereby enhancing trust in machine learning applications in spectroscopy.
Methodology
The SHAPCA framework utilizes Principal Component Analysis (PCA) to reduce the dimensionality of spectroscopic data while capturing the correlation structure among features. It then applies Shapley Additive Explanations (SHAP) to quantify the contributions of these latent components to model predictions. The contributions are back-projected to the original feature space, allowing for interpretable explanations that are directly linked to the biological significance of the data.
Results
Numerical analysis indicates that SHAPCA provides consistent and interpretable explanations across multiple training runs. The framework successfully identifies key spectral bands that influence model predictions, demonstrating enhanced stability in feature importance measures compared to traditional methods.
Implications
The SHAPCA framework has significant implications for the deployment of machine learning models in clinical and safety-critical settings, where understanding model predictions is essential. By improving interpretability, SHAPCA can facilitate the integration of machine learning in automated decision-making systems and enhance trust among practitioners in the biomedical and chemical analysis domains.
Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction
NLP
Large Language Models
Reinforcement Learning
- Introduces Latent Logic Augmentation to enhance decision-making capabilities in LLMs.
- Develops a Multiple Ground Truths dataset to reduce noise and capture semantic diversity.
- Presents a Hybrid Reward mechanism for efficient reinforcement learning.
- Demonstrates improved stability and performance in real-world Cloud service tasks.
Read more
Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction
Summary
This paper addresses the challenges of adapting Large Language Models (LLMs) for complex technical service domains, which are hindered by the lack of explicit cognitive chains in human demonstrations and the ambiguity of valid responses. The authors propose a lightweight adaptation framework that includes three main contributions: (1) Latent Logic Augmentation, which enhances the model's ability to internalize decision-making processes through Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation; (2) Robust Noise Reduction, which creates a Multiple Ground Truths dataset using a dual-filtering method to capture semantic diversity and reduce noise; and (3) Lightweight Adaptation, which introduces a Hybrid Reward mechanism that combines an LLM-based judge with a lightweight relevance-based Reranker to provide high-fidelity reward signals while minimizing computational costs. Empirical evaluations on real-world Cloud service tasks demonstrate that this framework achieves improved stability and performance, with the Hybrid Reward mechanism aligning closely with traditional methods but requiring less training time, highlighting its practical value for deploying technical service agents.
Methodology
The methodology involves three main components: (1) Latent Logic Augmentation through Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to instill decision logic; (2) Robust Noise Reduction by constructing a Multiple Ground Truths dataset using dual-filtering to validate diverse responses; and (3) Lightweight Adaptation via a Hybrid Reward mechanism that integrates an LLM-based judge with a relevance-based Reranker for efficient reward signal generation.
Results
The proposed framework was empirically validated on real-world Cloud service tasks, showing significant improvements in stability and performance. The Hybrid Reward mechanism provided alignment comparable to traditional LLM-as-a-judge methods while reducing training time, indicating its effectiveness and efficiency.
Implications
The findings suggest that the proposed lightweight adaptation framework can be effectively utilized in deploying technical service agents, enhancing their performance in complex domains without incurring high computational costs. This could lead to broader applications in various technical service environments.
GAPSL: A Gradient-Aligned Parallel Split Learning on Heterogeneous Data
Federated Learning
Efficient ML
Optimization
- GAPSL mitigates gradient directional inconsistency in parallel split learning.
- The framework includes Leader Gradient Identification (LGI) and Gradient Direction Alignment (GDA) components.
- GAPSL outperforms state-of-the-art methods in training accuracy and latency.
- The approach is particularly beneficial for resource-constrained client devices in federated learning scenarios.
Read more
GAPSL: A Gradient-Aligned Parallel Split Learning on Heterogeneous Data
Summary
The paper introduces GAPSL, a novel framework designed to enhance parallel split learning (PSL) in the context of federated learning (FL) on heterogeneous data. PSL has emerged as a solution to the computational burdens faced by resource-constrained client devices by offloading significant workloads to a server. However, it suffers from severe training divergence due to gradient directional inconsistencies among clients, which can lead to poor model convergence. GAPSL addresses this issue through two main components: Leader Gradient Identification (LGI) and Gradient Direction Alignment (GDA). LGI dynamically selects a subset of consistent client gradients to form a leader gradient that reflects the global convergence trend. GDA applies direction-aware regularization to align each clientβs gradient with the leader gradient, thus reducing inconsistencies and improving convergence. The authors evaluate GAPSL on a prototype computing testbed, demonstrating its superiority over existing benchmarks in terms of training accuracy and latency, making it a promising approach for federated learning applications in heterogeneous environments.
Methodology
GAPSL employs a two-pronged approach: first, it uses LGI to identify a leader gradient from a selection of directionally consistent client gradients. Second, it implements GDA, which applies a direction-aware regularization technique to align each client's gradient with the identified leader gradient, thereby addressing the issue of gradient divergence during training.
Results
Extensive experiments conducted on a prototype computing testbed show that GAPSL consistently achieves higher training accuracy and lower latency compared to existing benchmarks, indicating improved model convergence and efficiency in federated learning settings.
Implications
The findings suggest that GAPSL can significantly enhance the performance of federated learning systems, particularly in environments with heterogeneous data and resource-constrained devices. This has potential applications in various IoT scenarios where data privacy and efficient computation are critical.
Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models
Generative Models
Theory
Optimization
- Introduces Neural Galerkin Normalizing Flow (NGNF) for approximating TPDFs of diffusion processes.
- Ensures structural integrity of the solution through Normalizing Flows, maintaining positivity and mass conservation.
- Derives a system of ODEs for time evolution of flow parameters, enhancing the learning of causal relationships.
- Utilizes adaptive sampling to effectively evaluate Fokker-Planck residuals in high-dimensional PDEs.
Read more
Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models
Summary
This paper introduces a novel framework called Neural Galerkin Normalizing Flow (NGNF) aimed at approximating the transition probability density function (TPDF) of diffusion processes by solving the Fokker-Planck equation with an atomic initial distribution. The authors leverage Normalizing Flows to represent the solution as a transformation of a reference stochastic process, ensuring that the approximation adheres to structural constraints such as positivity and mass conservation. By extending Neural Galerkin methods to Normalizing Flows, the authors derive a system of ordinary differential equations (ODEs) that govern the time evolution of the flow parameters. The methodology employs adaptive sampling techniques to effectively evaluate the Fokker-Planck residual in high-dimensional spaces, which is crucial for addressing complex PDEs. The numerical experiments demonstrate that the NGNF framework successfully captures essential features of the true solution while maintaining the causal relationship between the initial condition and the density function over time. The offline training phase of the model allows for significantly more efficient online evaluations compared to traditional PDE solving methods, making it a promising surrogate model for various applications involving stochastic differential equations, including Bayesian inference and simulation.
Methodology
The methodology involves using Neural Galerkin schemes to derive ODEs that describe the time evolution of Normalizing Flow parameters. The framework incorporates adaptive sampling routines to evaluate the Fokker-Planck residual, ensuring that the solution remains accurate in high-dimensional spaces. The Normalizing Flow is designed to be the identity at the initial time, allowing for accurate propagation of the initial density.
Results
The numerical results indicate that the NGNF framework effectively captures key features of the true transition probability density function, enforcing the causal relationship between the initial condition and the subsequent density function. The method shows improved performance in high-dimensional settings and offers a more efficient alternative for online evaluations after an initial training phase.
Implications
The proposed NGNF framework has significant implications for various applications involving stochastic differential equations, such as Bayesian inference, simulation, and diffusion bridge generation. Its efficiency and accuracy make it a valuable tool for solving complex PDEs in practical scenarios.
Enactor: From Traffic Simulators to Surrogate World Models
Generative Models
Reinforcement Learning
Robotics
- Introduction of Enactor, a transformer-based generative model for traffic simulation.
- Model captures complex actor-actor interactions and generates physically consistent trajectories.
- Demonstrated effectiveness in a live simulation setting with significant performance improvements over traditional models.
- Requires fewer training samples compared to conventional agent-centric approaches.
Read more
Enactor: From Traffic Simulators to Surrogate World Models
Summary
The paper presents 'Enactor', a novel generative model designed to enhance traffic microsimulation by addressing the limitations of traditional behavior models in capturing realistic actor interactions at traffic intersections. While existing microsimulators like SUMO provide a framework for evaluating road network performance, they often rely on simplistic models that fail to accurately represent complex interactions among vehicles and pedestrians. The authors propose a transformer-based architecture that focuses on actor-centric generative modeling, allowing for the generation of physically grounded trajectories based on learned behaviors from data. The model is tested in a 'simulation-in-the-loop' setting, where it controls actor dynamics over 40,000 timesteps. Results indicate that Enactor effectively captures complex interactions and generates long-horizon, consistent trajectories, outperforming baseline models in key traffic engineering metrics, including a significant reduction in KL-Divergence. This work highlights the potential of integrating deep learning with traffic simulation to improve urban traffic modeling and planning.
Methodology
The authors developed an actor-centric generative model using a transformer-based architecture, which captures actor interactions and the geometry of traffic intersections. The model was tested in a simulation-in-the-loop framework, where initial conditions were generated using SUMO, and the model controlled actor dynamics over an extended period.
Results
The experimental results showed that Enactor effectively captures complex interactions among actors and generates long-horizon, physically consistent trajectories. It significantly outperformed baseline models in various traffic metrics, achieving over a 10x improvement in KL-Divergence, while requiring fewer training samples.
Implications
The findings suggest that integrating advanced generative models with traffic microsimulation can lead to more accurate and reliable urban traffic modeling. This could enhance the evaluation of infrastructure changes and traffic management strategies, ultimately improving urban mobility and safety.
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
Reinforcement Learning
Large Language Models
- HISR improves credit assignment in multi-turn RL by aligning rewards with sub-goals.
- A segment-level process reward model is introduced to avoid overly fine-grained reward allocation.
- The hindsight model captures action importance based on trajectory outcomes.
- Extensive experiments show HISR achieves state-of-the-art performance on agentic benchmarks.
Read more
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
Summary
The paper introduces HISR (Hindsight Information Modulated Segmental Process Rewards), a novel approach aimed at improving the performance of large language models (LLMs) in complex long-horizon decision-making tasks through multi-turn reinforcement learning (RL). Traditional reward models struggle with delayed rewards and unreliable credit assignment, particularly in tasks requiring multiple sub-goals. HISR addresses these challenges by aligning rewards with sub-goals and emphasizing significant segments of the task trajectory. The authors propose a segment-level process reward model that assigns rewards to each sub-goal, avoiding overly granular turn-level rewards. Additionally, a hindsight model is developed to assess action importance based on the likelihood of actions after knowing the trajectory outcomes. This model aggregates segment importance scores to modulate the segmental process rewards, thereby enhancing credit assignment reliability. The effectiveness of HISR is validated through extensive experiments on three publicly available benchmarks, demonstrating state-of-the-art performance in agentic tasks.
Methodology
The HISR approach utilizes a segment-level process reward model to assign rewards for sub-goals, thereby avoiding the pitfalls of turn-level granularity. A hindsight model is employed to evaluate the likelihood of actions post-outcome, which helps in determining the importance of actions. The ratios of sequence likelihoods between the hindsight and policy models are calculated to aggregate segment importance scores, which modulate the segmental process rewards.
Results
The experimental results indicate that HISR outperforms existing methods on three publicly available agentic benchmarks, achieving state-of-the-art performance. The case studies further validate the effectiveness of the proposed reward modulation strategy.
Implications
The HISR framework has significant implications for enhancing the capabilities of LLMs in complex decision-making scenarios, potentially leading to more effective applications in areas such as household assistance and other multi-turn tasks requiring agentic behavior.
A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks
Theory
Optimization
Efficient ML
- Introduction of adaptive wavelet-based activation functions to improve PINNs.
- Systematic evaluation across multiple PDE classes shows enhanced performance.
- Demonstrated robustness and accuracy over traditional activation functions.
- Validated against various models including PINNsFormer and other deep learning architectures.
Read more
A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks
Summary
This paper addresses the common failure modes in Physics-Informed Neural Networks (PINNs) by introducing a novel family of adaptive wavelet-based activation functions. The proposed functions enhance training stability and expressive power by integrating trainable wavelet functions with traditional activation functions like hyperbolic tangent and softplus. Five distinct activation functions are developed and evaluated within the PINN framework across four classes of partial differential equations (PDEs). The results demonstrate improved robustness and accuracy compared to conventional activation functions. The effectiveness of the proposed approach is validated through comparisons with baseline PINNs, transformer-based architectures, and other deep learning models, showcasing its generality and applicability in scientific computing.
Methodology
The study develops five adaptive activation functions that combine wavelet functions with either trainable or fixed traditional activation functions. These functions are systematically evaluated in the context of PINNs applied to four representative PDE classes. The performance is compared using comprehensive bar plots and direct comparisons with existing models.
Results
The proposed adaptive activation functions significantly improve the training stability and accuracy of PINNs. The evaluations reveal that these functions outperform traditional activation functions, demonstrating enhanced robustness in handling complex PDEs and mitigating common failure modes associated with standard PINN architectures.
Implications
The findings suggest that integrating wavelet-based activation functions can lead to more effective and reliable PINN models, which could be beneficial in various applications involving complex physical phenomena, such as fluid dynamics, medical imaging, and other scientific computing tasks.
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
Robotics
Time Series
Theory
- Introduces a unified inference framework for UAV fault detection using HEP statistical methods.
- Achieves high detection rates and low false alarm rates through the application of LRT, CLs, and SNPE.
- Demonstrates superior performance compared to traditional fault detection methods like CUSUM and autoencoders.
- Provides calibrated uncertainty estimates for fault severity, enhancing decision-making for operators.
Read more
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
Summary
This paper presents a novel approach to UAV fault detection by applying statistical methods from high-energy physics (HEP) to detect blade damage in multirotor UAVs. The study focuses on three key statistical techniques: the likelihood ratio test (LRT) for binary detection, the CLs method for controlling false alarm rates, and sequential neural posterior estimation (SNPE) for characterizing fault severity. By analyzing spectral features related to rotor harmonics, the proposed system provides three outputs: binary detection, controlled false alarm rates, and calibrated posteriors indicating fault severity and motor location. The methodology was validated using the UAV-FD dataset, which includes 18 real flights with varying levels of blade damage. The results showed an area under the curve (AUC) of 0.862 Β± 0.007, significantly outperforming traditional methods like CUSUM and autoencoders. Additionally, the system achieved a 93% detection rate for significant blade damage and an 81% rate for subtle damage at a 5% false alarm rate. On a quadrotor platform (PADRE), the AUC reached 0.986 after model refitting. The SNPE method provided a full posterior over fault severity, offering operators a continuous estimate of damage with calibrated uncertainty, which is a significant advancement over existing UAV fault detection methods.
Methodology
The methodology involves applying three statistical techniques from high-energy physics: the likelihood ratio test (LRT) for binary detection, the CLs method for controlling false alarm rates, and sequential neural posterior estimation (SNPE) for fault characterization. The system analyzes spectral features related to rotor harmonics and employs leave-one-flight-out cross-validation to ensure robust evaluation.
Results
The proposed system achieved an AUC of 0.862 Β± 0.007 on the UAV-FD dataset, outperforming CUSUM (0.708 Β± 0.010) and autoencoder methods (0.753 Β± 0.009). At a 5% false alarm rate, it detected 93% of significant and 81% of subtle blade damage. On the PADRE dataset, the AUC reached 0.986 after model refitting, with SNPE providing a 90% credible interval coverage of 92β100% and a mean absolute error of 0.012.
Implications
The findings suggest that HEP statistical methods can significantly enhance UAV fault detection capabilities, particularly in safety-critical applications. The ability to provide calibrated uncertainty estimates allows operators to make more informed decisions regarding UAV maintenance and operation, potentially improving safety and efficiency in various UAV applications.
An Optimised Greedy-Weighted Ensemble Framework for Financial Loan Default Prediction
Optimization
Interpretability
- Proposes a dynamic ensemble framework for loan default prediction that adapts to changing data conditions.
- Utilizes Particle Swarm Optimization for hyperparameter tuning of multiple classifiers.
- Achieves significant improvements in predictive performance over traditional models and static ensemble methods.
- Identifies key features influencing loan defaults, enhancing interpretability of the model.
Read more
An Optimised Greedy-Weighted Ensemble Framework for Financial Loan Default Prediction
Summary
This paper addresses the challenge of accurately predicting loan defaults in the context of credit risk management, particularly given the complexities of modern financial datasets characterized by nonlinear relationships and class imbalances. The authors propose an Optimised Greedy-Weighted Ensemble framework that dynamically allocates model weights based on empirical predictive performance, integrating multiple machine learning classifiers. The hyperparameters of these classifiers are optimized using Particle Swarm Optimization, and predictions are combined through a regularized greedy weighting mechanism. Additionally, a neural-network-based meta-learner is employed within a stacked-ensemble architecture to capture higher-order relationships among model outputs. Experiments conducted on the Lending Club dataset demonstrate that the proposed framework significantly outperforms individual classifiers, achieving an Area Under the Curve (AUC) of 0.80, a macro-average F1-score of 0.73, and a default recall of 0.81. The study also identifies key predictors of loan default, such as revolving utilization, annual income, and debt-to-income ratio, highlighting the importance of performance-driven ensemble weighting in enhancing predictive accuracy and interpretability in credit risk modeling.
Methodology
The study employs an Optimised Greedy-Weighted Ensemble framework that integrates multiple machine learning classifiers. Hyperparameters are optimized using Particle Swarm Optimization, and model predictions are combined through a regularized greedy weighting mechanism. A neural-network-based meta-learner is used in a stacked-ensemble architecture to capture complex relationships among model outputs.
Results
The proposed BlendNet ensemble achieved an AUC of 0.80, a macro-average F1-score of 0.73, and a default recall of 0.81 on the Lending Club dataset. Calibration analysis indicated that tree-based ensembles like Extra Trees and Gradient Boosting provided reliable probability estimates, while the stacked ensemble demonstrated superior ranking capability.
Implications
The findings suggest that the proposed framework can enhance predictive accuracy and interpretability in credit risk modeling, providing a scalable, data-driven approach for financial institutions in credit assessment and risk monitoring.
Taming Epilepsy: Mean Field Control of Whole-Brain Dynamics
Graph Learning
Optimization
Theory
- Introduction of the GK-MFG framework for controlling epileptic seizures.
- Integration of Reservoir Computing and graph-theoretic modeling for enhanced robustness.
- Use of graph Laplacian constraints to respect the brain's functional topology.
- Demonstration of the framework's effectiveness in suppressing seizures in high-dimensional networks.
Read more
Taming Epilepsy: Mean Field Control of Whole-Brain Dynamics
Summary
This paper addresses the challenge of controlling high-dimensional neural dynamics during epileptic seizures, which is complicated by the brain's nonlinear characteristics and complex connectivity. The authors propose a novel framework called Graph-Regularized Koopman Mean-Field Game (GK-MFG), which integrates Reservoir Computing (RC) for approximating the Koopman operator with an Alternating Population and Agent Control Network (APAC-Net) to solve distributional control problems. By embedding Electroencephalogram (EEG) dynamics into a linear latent space and applying graph Laplacian constraints derived from the Phase Locking Value (PLV), the GK-MFG framework achieves robust seizure suppression while maintaining the functional topological structure of the brain. The study emphasizes the importance of graph-theoretic modeling, reservoir computing, and mean field optimal control in effectively managing the dynamics of complex stochastic systems like the brain, ultimately leading to improved outcomes in epilepsy treatment.
Methodology
The GK-MFG framework is developed through a logical chain that first establishes a mean field distribution control objective using APAC-Net, incorporates graph Laplacian regularization to embed the brain's physical connectivity into the control objective, and utilizes the RC-Koopman operator to address computational challenges in high-frequency nonlinear predictions.
Results
The proposed GK-MFG framework demonstrates unprecedented robustness and accuracy in suppressing seizures across high-dimensional epileptic networks, effectively managing the complex dynamics of brain activity during seizures.
Implications
The findings suggest that the GK-MFG framework could significantly advance the treatment of epilepsy by providing a more effective means of controlling neural dynamics, potentially leading to improved therapeutic strategies and better patient outcomes.
InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
Efficient ML
Computer Vision
NLP
- Introduces InfoMamba, an attention-free hybrid architecture for sequence modeling.
- Balances fine-grained local interactions with long-range dependencies without quadratic complexity.
- Employs a concept-bottleneck linear filtering layer and information-maximizing fusion for dynamic context integration.
- Demonstrates superior performance compared to existing Transformer and SSM models across multiple benchmarks.
Read more
InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
Summary
The paper presents InfoMamba, a novel attention-free hybrid architecture that addresses the challenges of sequence modeling by balancing fine-grained local modeling with long-range dependency capture. Traditional Transformers excel in token mixing but suffer from quadratic complexity, while Mamba-style selective state-space models (SSMs) scale linearly but struggle with high-rank global interactions. The authors conduct a consistency boundary analysis to identify the regimes where diagonal short-memory SSMs can approximate causal attention and highlight existing structural gaps. InfoMamba replaces self-attention with a concept-bottleneck linear filtering layer, which serves as a minimal-bandwidth global interface, and integrates it with a selective recurrent stream through information-maximizing fusion (IMF). This approach allows for dynamic global context injection into SSM dynamics while enforcing complementary information usage through a mutual-information-inspired objective. Extensive experiments demonstrate that InfoMamba outperforms state-of-the-art Transformer and SSM baselines across various tasks, achieving a favorable accuracy-efficiency trade-off with near-linear scaling.
Methodology
The methodology involves a consistency boundary analysis to identify the limitations of existing models, followed by the design of InfoMamba, which integrates a concept-bottleneck linear filtering layer with a selective recurrent stream. The information-maximizing fusion (IMF) mechanism is employed to couple these two pathways, ensuring effective global context aggregation and local detail preservation while maintaining computational efficiency.
Results
InfoMamba consistently outperformed state-of-the-art Transformer and SSM baselines in classification, dense prediction, and non-vision tasks. The model achieved significant performance gains while maintaining near-linear scaling, demonstrating a favorable accuracy-efficiency trade-off.
Implications
The development of InfoMamba has potential applications in various domains requiring efficient sequence modeling, such as natural language processing, computer vision, and time-series forecasting. Its ability to handle long contexts and high resolutions efficiently could facilitate advancements in real-world applications.
LLM-Augmented Computational Phenotyping of Long Covid
NLP
Large Language Models
Time Series
- Introduction of 'Grace Cycle', an LLM-augmented framework for phenotyping Long Covid.
- Identification of three clinical phenotypes: Protected, Responder, and Refractory.
- Significant differences in symptom severity and treatment response among identified phenotypes.
- Demonstration of LLMs' capability to enhance clinical data analysis and hypothesis testing.
Read more
LLM-Augmented Computational Phenotyping of Long Covid
Summary
This paper presents a novel framework called 'Grace Cycle' that leverages large language models (LLMs) for computational phenotyping of Long Covid. The study addresses the challenge of understanding the heterogeneity of Long Covid by proposing an iterative process that integrates hypothesis generation, evidence extraction, and feature refinement using longitudinal patient data. The framework successfully identifies three distinct clinical phenotypes: Protected, Responder, and Refractory, based on an analysis of 13,511 participants from the NIH RECOVER program. These phenotypes show significant differences in symptom severity, baseline disease burden, and longitudinal response patterns to treatments. The study highlights the potential of LLMs to enhance phenotyping processes in clinical research, providing a disease-agnostic approach that can be adapted for various health conditions. The findings emphasize the importance of integrating AI tools in healthcare to improve personalized interventions and understanding of complex diseases.
Methodology
The proposed framework, 'Grace Cycle', employs an iterative process where initial hypotheses are generated, and LLMs extract evidence from longitudinal patient data. The process is repeated, refining hypotheses based on evidence alignment until convergence is achieved. Statistical analyses are then conducted to validate the findings.
Results
The analysis of data from 13,511 Long Covid participants revealed three distinct clinical phenotypes, each characterized by varying levels of symptom severity and treatment responses. The Protected phenotype maintained low symptom scores, while the Responder and Refractory phenotypes exhibited higher severity and different longitudinal patterns. Statistical tests confirmed the significance of these findings across multiple dimensions.
Implications
The study suggests that integrating LLMs into clinical research can significantly enhance the understanding of complex diseases like Long Covid, leading to better-targeted interventions and improved patient outcomes. The disease-agnostic nature of the framework allows for its application in various medical contexts, potentially transforming how chronic diseases are studied and managed.