AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
43
Papers today
8h
Update frequency
7
Days of history
Revisiting Decentralized Online Convex Optimization with Compressed Communication
Optimization
Theory
Efficient ML
- Introduction of two FTRL-type algorithms for D-OCO with compressed communication.
- First algorithm matches existing regret bounds in full-information settings.
- Second algorithm improves regret bounds and communication costs in bandit settings.
- The dual update mechanism of FTRL facilitates effective communication compression.
Read more
Revisiting Decentralized Online Convex Optimization with Compressed Communication
Summary
This paper addresses the decentralized online convex optimization (D-OCO) problem, particularly focusing on the use of compressed communication to alleviate communication bottlenecks in distributed applications with streaming data. The authors introduce two novel follow-the-regularized-leader (FTRL)-type algorithms designed for D-OCO with compressed communication. These algorithms are shown to be more elegant in design and theoretical analysis compared to existing online gradient descent (OGD)-type algorithms. The first algorithm operates in a full-information setting and matches existing regret bounds, while the second algorithm is tailored for a bandit setting, significantly improving both regret bounds and communication costs. The key insight lies in leveraging the dual update mechanism of FTRL, which allows for effective application of average consensus techniques with communication compression. The paper demonstrates that the proposed algorithms not only simplify the existing approaches but also enhance performance metrics, making them suitable for decentralized learning scenarios.
Methodology
The authors propose two FTRL-type algorithms for D-OCO, utilizing a dual update mechanism to manage local approximations of cumulative average gradients. The first algorithm is designed for full-information scenarios, while the second addresses bandit settings. Both algorithms incorporate communication compression techniques, specifically leveraging Choco-Gossip for average consensus.
Results
The first algorithm achieves regret bounds comparable to existing methods, while the second algorithm improves upon previous regret bounds and reduces communication rounds significantly. Specifically, it achieves O(nT^3/4) and O(nT^2/3(log T)^{1/3}) for convex and strongly convex functions, respectively, with fewer communication rounds than prior algorithms.
Implications
The proposed algorithms can enhance the efficiency of decentralized learning systems, particularly in environments with limited communication resources. This work opens avenues for further research in optimizing decentralized algorithms and can be applied in various distributed applications, such as federated learning and real-time data processing.
Probabilistic Low-Voltage Peak Load Forecasting with Time Series Foundation Models Evaluated on Application-Oriented Metrics
Time Series
- Extensive evaluation of time series foundation models for low-voltage load forecasting.
- Chronos-2 demonstrated superior performance in peak load prediction.
- Ablation study indicates TSFMs can handle uncertainty without weather covariates.
- Introduction of a novel application-oriented metric for evaluating forecasting performance.
Read more
Probabilistic Low-Voltage Peak Load Forecasting with Time Series Foundation Models Evaluated on Application-Oriented Metrics
Summary
This paper addresses the challenges of low-voltage load forecasting in energy systems characterized by high electrification and decentralized generation. Traditional forecasting methods often require extensive manual effort and lack uncertainty estimation, particularly in peak load prediction. The authors evaluate the performance of three time series foundation models (TSFMs)βChronos-Bolt, Chronos-2, and TabPFN-TSβagainst six baseline models using data from 200 real-world low-voltage feeders. The study highlights the superior performance of Chronos-2, particularly in peak load forecasting. An ablation study reveals that while weather covariates are significant, TSFMs can adapt to increased uncertainty when such information is omitted. The authors introduce a novel application-oriented metric that connects forecasting capabilities to grid asset planning, emphasizing the trade-off between cost reduction and failure risk. This work contributes to the understanding of TSFMs in low-voltage load forecasting and provides insights for distribution system operators (DSOs) in managing grid capacity and planning.
Methodology
The study employs an extensive evaluation framework comparing three TSFMs with baseline models using a recently published dataset (FeederBW). An ablation study is conducted to assess the impact of weather covariates on forecasting performance. A novel application-oriented metric is developed to evaluate the forecasting capabilities in relation to grid asset management.
Results
The results indicate that the TSFM Chronos-2 outperforms other models, particularly in peak load forecasting. The ablation study shows that while weather information enhances forecasting accuracy, TSFMs can still provide reliable forecasts under increased uncertainty. The proposed application-oriented metric effectively links forecasting performance to practical grid management needs.
Implications
The findings suggest that TSFMs can significantly improve low-voltage load forecasting, aiding DSOs in better managing grid capacity and planning for future overload situations. The novel metric can assist in aligning forecasting capabilities with operational requirements, potentially influencing policy and operational strategies in energy distribution.
WARP: Weight-Space Analysis for Recovering Training Data Portfolios
NLP
Large Language Models
Interpretability
- WARP recovers domain mixtures from fine-tuned model weights, addressing the access asymmetry in AI research.
- The framework uses model merging to simulate training trajectories, allowing for the extraction of geometric features related to training data.
- WARP demonstrates superior performance compared to traditional membership inference methods.
- The method is robust across different training recipes, including overtraining scenarios.
Read more
WARP: Weight-Space Analysis for Recovering Training Data Portfolios
Summary
The paper introduces WARP, a novel framework designed to recover the training data mixtures of fine-tuned models directly from their released weights. The authors highlight the issue of access asymmetry in AI, where the training data used for foundation models is often proprietary and undisclosed, limiting researchers' ability to understand model behavior. Existing methods for inferring training data, such as membership inference, focus on individual samples and fail to provide a comprehensive view of the training corpus. WARP addresses this gap by utilizing model merging techniques to interpolate between base and fine-tuned models, generating pseudo-checkpoints that reveal the geometric structure of the training data in weight space. The framework employs geometric feature extraction to estimate domain proportions using either a parameter-free softmax readout or a trained MLP projector. Empirical evaluations on BERT and GPT-2 demonstrate that WARP can recover domain mixtures with high accuracy, achieving mean absolute errors of 0.046 and 0.104, respectively, and outperforming existing methods, including those with access to true training trajectories. The robustness of WARP across various training scenarios, including early-stop and overtrained checkpoints, further underscores its practical applicability.
Methodology
WARP employs model merging techniques to interpolate between base and fine-tuned models, creating pseudo-checkpoints that approximate the training trajectory. It extracts geometric features from these checkpoints and maps them to domain proportions using either a parameter-free softmax readout or a multi-layer perceptron (MLP) trained on synthetic mixtures.
Results
In controlled experiments with BERT and GPT-2, WARP achieved mean absolute errors of 0.046 and 0.104 for domain mixture recovery, respectively, outperforming sample-level membership inference baselines and a variant that utilized the true training trajectory.
Implications
WARP has significant implications for the AI research community by providing a method to recover training data distributions, facilitating better model auditing, understanding model behaviors, and enabling more informed fine-tuning practices without access to proprietary training datasets.
Conditional Inference Trees and Forests for Feature Selection
Theory
Efficient ML
- CIT and CIF effectively reduce split-selection bias in feature selection.
- CIF ranks 4th among 17 classification methods and 3rd among 18 regression methods in benchmark tests.
- Adaptive stopping and threshold search parameters significantly influence runtime efficiency.
- High-dimensional simulations reveal potential shortcomings in feature sampling strategies.
Read more
Conditional Inference Trees and Forests for Feature Selection
Summary
This paper investigates Conditional Inference Trees (CIT) and Conditional Inference Forests (CIF) as methods for feature selection, emphasizing their ability to reduce split-selection bias by separating feature testing from threshold optimization. The authors present a benchmark that evaluates the performance of CIT and CIF against other classification and regression methods across multiple datasets. They conduct runtime ablation studies to understand the impact of hyperparameters on computational efficiency while maintaining ranking quality. The findings indicate that CIF ranks favorably among existing methods, particularly in terms of feature ranking for downstream prediction tasks. The paper also highlights the computational challenges associated with permutation tests and threshold searches, suggesting that adaptive stopping and the number of thresholds searched significantly affect runtime. Additionally, the authors explore the limitations of forest feature sampling in high-dimensional settings, where informative features may be overlooked. Overall, the study supports the use of CIF as a robust top-k feature-ranking method in various predictive modeling scenarios.
Methodology
The authors utilize a conditional inference framework where feature selection is decoupled from threshold optimization. They conduct empirical evaluations using real-data benchmarks and synthetic experiments to assess the performance of CIT and CIF. The study includes runtime ablation analyses to determine the effects of various hyperparameters on computational efficiency and ranking quality.
Results
CIF demonstrated strong performance in feature ranking, achieving high ranks in both classification and regression tasks across multiple datasets. The runtime ablation studies revealed that disabling adaptive stopping and using exact threshold searches significantly increased fitting times, while only marginally affecting downstream model performance. The analysis of high-dimensional simulations indicated that forest feature sampling could lead to the exclusion of informative features in split decisions.
Implications
The findings suggest that CIF can be a valuable tool for feature selection in machine learning applications, particularly in scenarios where reducing computational costs is essential. The insights into runtime efficiency and feature sampling limitations may guide future research and practical implementations in predictive modeling.
Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking
Theory
- Demonstrates a coordinated manipulation attack on crowdsourced fact-checking systems.
- Empirical findings show that a small number of coordinated ratings can significantly alter note quality scores.
- Reveals a counterintuitive property of the rating system where 'Not Helpful' ratings can enhance a note's perceived helpfulness.
- Develops a cost model for manipulation efforts, informing mitigation strategies.
Read more
Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking
Summary
This paper investigates the vulnerabilities of crowdsourced fact-checking systems, specifically focusing on the matrix factorization algorithms used in platforms like X and Meta. These systems aim to combat misinformation by requiring diverse agreement on notes that flag misleading content. The authors present a novel adversarial attack that demonstrates how coordinated users can manipulate these systems to create a synthetic consensus. Through a two-phase attack strategy, adversarial accounts establish diverse positions in latent factor space and then coordinate to boost a target note's helpfulness score. The empirical analysis reveals that up to 10.7% of lower-quality notes can be manipulated above consensus thresholds with fewer than 10 coordinated ratings. Additionally, the paper uncovers a counterintuitive finding where rating a note as 'Not Helpful' can inadvertently increase its helpfulness score. The authors also propose a cost model for manipulation efforts and discuss mitigations implemented in X's Community Notes to address these vulnerabilities.
Methodology
The authors employed a two-phase attack strategy involving empirical analysis of production data from X Community Notes, theoretical analysis of the matrix factorization algorithm, and a cost model to quantify manipulation efforts. They also examined the open data and source code of the Community Notes system to facilitate their study.
Results
The study found that coordinated manipulation could elevate the helpfulness scores of lower-quality notes by up to 10.7% with fewer than 10 ratings. The theoretical analysis revealed unexpected dynamics in the rating system, indicating that certain ratings could counterintuitively improve a note's score.
Implications
The findings underscore the vulnerabilities in crowdsourced fact-checking systems and the potential for adversarial manipulation. This research has significant implications for the design of more robust algorithms that can resist coordinated attacks, thereby enhancing the integrity of fact-checking mechanisms on social media platforms.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
NLP
Large Language Models
Efficient ML
- First systematic study on the impact of expert pruning on factual reliability in MoE models within the biomedical domain.
- Moderate pruning preserves utility while extreme pruning increases hallucination risks.
- Utility and reliability degrade significantly when shifting from in-domain to general-domain tasks.
- Reliability assessments are essential for high-stakes deployments, beyond mere utility evaluations.
Read more
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
Summary
This paper investigates the effects of expert pruning on the utility and factual reliability of Mixture-of-Experts (MoE) models, particularly in the biomedical domain. While MoE models enhance inference speed by activating a subset of experts, they also require significant memory resources. The authors explore how structured expert pruning can mitigate deployment costs but note that prior research has largely focused on benchmark utility, neglecting the crucial aspect of factual reliability. The study evaluates four MoE models using six pruning methods and various pruning ratios across generation and classification tasks, both in-domain (biomedical) and cross-domain. The findings indicate that moderate pruning maintains in-domain utility without immediate reliability degradation, although extreme pruning ratios increase the risk of hallucinations. In contrast, transitioning to general domains leads to rapid declines in both utility and reliability. The authors conclude that safe compression is highly task- and domain-dependent, emphasizing the necessity of reliability assessments alongside utility evaluations for high-stakes applications.
Methodology
The authors conducted a systematic evaluation of four MoE models, applying six different pruning methods and testing multiple pruning ratios. They assessed the models' performance on generation and classification tasks, comparing results in both biomedical and general domains to analyze the effects of pruning on utility and reliability.
Results
The study found that moderate pruning does not significantly affect in-domain utility or reliability, while extreme pruning ratios lead to increased hallucination risks. In general-domain tasks, both utility and reliability showed rapid degradation. The results underscore the importance of considering both utility and reliability when evaluating pruned MoE models, especially in high-stakes scenarios.
Implications
The findings suggest that careful consideration of pruning strategies is crucial for deploying MoE models in sensitive areas like biomedicine. The necessity for reliability assessments indicates a shift in evaluation criteria for model deployment in high-stakes environments, potentially influencing future research and practical applications in the field.
Multi-Head Recurrent Memory Agents
Large Language Models
NLP
Optimization
- Identifies memory retention failure as the key issue in recurrent memory agents for long contexts.
- Proposes Multi-Head Recurrent Memory (MHM) to structurally prevent overwriting of retained information.
- Introduces MHM-LRU, a lightweight implementation that guarantees uniform memory head utilization.
- Demonstrates substantial improvements in memory retention and accuracy across various benchmarks.
Read more
Multi-Head Recurrent Memory Agents
Summary
This paper addresses the limitations of recurrent memory agents in handling long contexts, particularly the degradation of performance as context length increases. The authors diagnose this issue by analyzing memory capture and retention, finding that retention is the primary bottleneck due to the monolithic nature of existing memory designs. To overcome this, they propose the Multi-Head Recurrent Memory (MHM) framework, which partitions memory into independent heads and employs a stage-wise select-then-update strategy. This approach allows for one head to be updated while others remain unchanged, thus protecting retained information from being overwritten. The paper introduces a specific implementation called Least-Recently-Updated MHM (MHM-LRU), which ensures uniform head utilization without additional token overhead. Extensive experiments demonstrate that MHM-LRU significantly enhances memory retention and end-to-end accuracy across a range of context lengths, outperforming existing methods. Notably, it improves memory retention rates from below 30% to 73.96% at 896K tokens, showcasing the effectiveness of architectural optimization in recurrent memory agents.
Methodology
The authors conducted a quantitative analysis of memory capture and retention rates in existing recurrent memory agents. They then designed the Multi-Head Recurrent Memory framework, which utilizes multiple independent memory heads and a selective update mechanism to improve retention. The MHM-LRU variant was implemented to ensure efficient head utilization without increasing token overhead.
Results
The experiments showed that MHM-LRU significantly improved memory retention rates, achieving 73.96% retention at 896K tokens, compared to less than 30% in baseline models. Additionally, MHM-LRU maintained strong end-to-end accuracy across context lengths from 100K to 1M tokens, while baseline methods exhibited sharp performance degradation.
Implications
The findings suggest that architectural innovations can effectively enhance the reliability of long-context processing in LLMs, making them more suitable for complex tasks requiring extensive information tracking and integration. This could lead to advancements in applications such as document analysis, multi-step research, and extended dialogue systems.
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
NLP
Large Language Models
Efficient ML
- Introduces the Program-as-Weights (PAW) paradigm for fuzzy function programming.
- Utilizes a two-stage compilation process to convert natural language specifications into neural binaries.
- Demonstrates significant efficiency gains with a smaller interpreter outperforming larger models.
- Releases FuzzyBench, a dataset with 10 million examples for fuzzy tasks.
Read more
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
Summary
The paper introduces a novel programming paradigm called Program-as-Weights (PAW) aimed at addressing the challenges of implementing fuzzy functions that resist clean rule-based programming. Traditional programming approaches often fall short for tasks like log filtering or JSON repair due to their inherent fuzziness and ambiguity. PAW allows developers to describe functions in natural language, which are then compiled into compact, locally-executable neural artifacts. The authors present a two-stage compilation process using a 4B Qwen3 model, resulting in a small neural binary that can be executed by a lightweight interpreter. The system demonstrates significant efficiency, with a 0.6B Qwen3 interpreter outperforming direct prompting of a larger model while using substantially less memory. The paper also showcases five case studies illustrating PAW's versatility across various fuzzy tasks and highlights its potential for local execution, reducing reliance on large language model APIs. The authors release a dataset, FuzzyBench, containing 10 million examples for training and evaluation, further contributing to the field of fuzzy function programming.
Methodology
The methodology involves a two-stage compilation process where a natural language specification is first transformed into a pseudo-program using a pre-trained model, followed by the generation of a parameter-efficient adapter (LoRA) that tunes a frozen interpreter for the specific task. This approach allows for the creation of small, reusable neural programs that can be executed locally.
Results
The results indicate that the Qwen3-0.6B interpreter executing PAW programs achieves a performance of 73.78% exact match, surpassing the 68.70% exact match of direct prompting with the larger Qwen3-32B model. The PAW system operates at approximately one fiftieth of the inference memory usage and runs at 30 tokens per second on a MacBook M3.
Implications
The PAW paradigm has significant implications for the future of programming, particularly in enabling local execution of fuzzy functions without the need for costly API calls to large language models. This could enhance reproducibility, reduce costs, and improve the efficiency of software development in various applications.
A More Accurate Algorithm Comparison through A/B Testing using Offline Evaluation Methods
Theory
Optimization
- A/B testing can lead to higher algorithm selection error rates than offline evaluation methods.
- The proposed estimator introduces a middle algorithm to induce positive correlation, improving selection accuracy.
- The new method achieves the same selection error rate as traditional methods with only half the data.
- The findings challenge the established view of A/B testing as the superior method for algorithm selection.
Read more
A More Accurate Algorithm Comparison through A/B Testing using Offline Evaluation Methods
Summary
This paper challenges the conventional wisdom that A/B testing is the gold standard for algorithm selection in online services, revealing that it can produce a higher selection error rate than offline evaluation methods. The authors identify that the sample mean estimator used in A/B testing lacks positive correlation, which is crucial for minimizing selection errors. In contrast, offline evaluation methods, such as Inverse Propensity Scoring (IPS), unintentionally induce beneficial correlations by utilizing shared offline data. To address this issue, the authors propose a novel estimator that intentionally induces positive correlation by introducing a hypothetical middle algorithm. This method estimates performance differences in a stepwise manner, leveraging shared data at each step, thereby enhancing the accuracy of A/B testing. The paper also derives the optimal middle algorithm to minimize variance and conducts bias-variance analysis to demonstrate the advantages of the proposed method. Experimental results on real-world data show that the new estimator achieves the same selection error rate as existing methods while requiring only half the A/B testing data, indicating a significant improvement in sample efficiency.
Methodology
The authors conducted a pre-experiment comparing the selection error rates of A/B testing (using the sample mean estimator) and offline evaluation (using IPS). They proposed a new estimator that incorporates a hypothetical middle algorithm to induce positive correlation in performance estimation, thus enhancing the accuracy of A/B testing.
Results
The proposed estimator matched the selection error rate of existing methods while utilizing only 50% of the A/B testing data, demonstrating a twofold improvement in sample efficiency. The experiments indicated that A/B testing could be less reliable than offline evaluation under certain conditions.
Implications
The findings suggest that practitioners should reconsider the reliance on A/B testing as the sole method for algorithm selection. The proposed methodology could lead to more efficient and accurate decision-making in various online services, such as recommender systems and digital advertising.
SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs
Graph Learning
Large Language Models
Interpretability
- SABER integrates LLM-derived semantics directly into the brain network classification process.
- The framework employs multi-scale hypergraphs to effectively model complex interactions among brain regions.
- A decision-level semantic alignment mechanism allows for direct influence of semantic information on predictions.
- Extensive evaluations show SABER outperforms existing methods on datasets like ABIDE and ADHD-200.
Read more
SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs
Summary
The paper presents SABER, a novel framework for brain network analysis that integrates high-level semantic knowledge from large language models (LLMs) into the decision-making process for brain disease diagnosis. Traditional methods often treat semantic information as auxiliary features, limiting their effectiveness in classification tasks. SABER addresses this by incorporating ROI-level semantics through global self-attention, enhancing node representations with whole-brain context. The framework constructs multi-scale hypergraphs to model functional subnetworks and multi-ROI interactions, overcoming the limitations of conventional graph neural networks (GNNs) in capturing high-order dependencies. A decision-level semantic alignment mechanism is introduced to inject patient-specific textual embeddings into graph representations, allowing semantics to directly influence predictions without altering the underlying network structure. The proposed method demonstrates state-of-the-art performance on public brain network datasets, particularly in small-sample scenarios, enhancing both stability and interpretability.
Methodology
SABER's methodology consists of three main stages: (1) Multi-scale node-level encoding where ROI semantics are integrated using global self-attention, (2) Construction of multi-scale hypergraphs to capture high-order and multi-ROI interactions, and (3) A decision-level semantic alignment mechanism that injects patient-specific textual embeddings into the graph representation, facilitating direct semantic influence on predictions.
Results
The experiments conducted on the ABIDE and ADHD-200 datasets demonstrate that SABER achieves state-of-the-art performance, showing significant improvements in classification accuracy, stability, and interpretability compared to existing methods, particularly in scenarios with limited sample sizes.
Implications
The integration of LLM-derived semantics into brain network analysis has the potential to enhance diagnostic capabilities for brain diseases, providing clinicians with more robust tools for interpreting neuroimaging data. This framework could lead to better understanding and treatment of conditions such as Autism Spectrum Disorder and Attention-Deficit/Hyperactivity Disorder.
Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials
Optimization
Efficient ML
- SOAP and SOAP-Muon optimizers consistently improve energy and force accuracy compared to AdamW.
- These optimizers demonstrate robust performance even with reduced force supervision, suggesting a pathway to lower force-label requirements.
- The resulting MLIPs maintain physical fidelity, accurately reproducing ab initio calculations and experimental data.
- Muon shows limited benefits over AdamW, indicating that not all optimizers provide substantial improvements.
Read more
Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials
Summary
This paper investigates the impact of optimizer choice on the training of machine learning interatomic potentials (MLIPs), which are crucial for scientific simulations in chemistry and materials science. While the MLIP community has focused on improving model architectures and datasets, the choice of optimizer has been largely static, with Adam and its variants being the default. The authors implement and compare three recently proposed matrix-structured optimizersβMuon, SOAP, and the hybrid SOAP-Muonβagainst AdamW in training NequIP and Allegro MLIP models. The study reveals that SOAP and SOAP-Muon significantly outperform Adam in terms of convergence speed and final accuracy, especially under conditions of partial force supervision. The findings suggest that the choice of optimizer is a critical yet often overlooked aspect of MLIP design, with implications for reducing the reliance on force labels during training, which can be costly in certain computational contexts.
Methodology
The authors integrated Muon, SOAP, and SOAP-Muon optimizers into the NequIP MLIP framework and benchmarked their performance against AdamW on two significant physical systems: liquid water and solid acid electrolyte CsH2PO4. They examined the optimizers' behavior under varying levels of force supervision, including energy-only training, to assess their efficiency and accuracy.
Results
The results indicate that SOAP and SOAP-Muon optimizers lead to faster convergence and higher accuracy in energy and force predictions compared to AdamW. Notably, SOAP-Muon trained with only 50% of force labels achieved accuracy comparable to AdamW trained with 100% force labels. Furthermore, SOAP-Muon maintained physical fidelity even when trained with just 5% of force labels, while the AdamW model became unstable under the same conditions.
Implications
The findings suggest that adopting more advanced optimizers like SOAP and SOAP-Muon can enhance the efficiency and effectiveness of MLIP training, potentially reducing computational costs associated with force label generation. This could facilitate the development of more accurate MLIPs for complex systems in materials science and chemistry, where computational resources are often limited.
Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection
Theory
Time Series
- The study evaluates Rolling Split Conformal Prediction for detecting pre-incident traction loss in motorsport telemetry.
- A significant methodological correction was made by including vehicle speed as an explicit feature in the model.
- The results showed a mean precision and recall of 0.0, indicating the method's ineffectiveness in real-world applications.
- High false-alarm rates were attributed to violations of the exchangeability assumption in the conformal prediction framework.
Read more
Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection
Summary
This paper investigates the efficacy of Rolling Split Conformal Prediction as a method for pre-incident traction loss detection in motorsport telemetry data. The study utilizes telemetry data from the 2023 Italian Grand Prix, focusing on 19 drivers and employing a Random Forest model to predict expected slip behavior. The research aims to provide an early warning signal for traction loss by monitoring the volatility of non-conformity residuals. However, the evaluation reveals a significant shortcoming: the method achieves a mean precision and recall of essentially 0.0 against real incident labels, indicating a high false-alarm rate of 15.3%. The study identifies that the underlying assumption of exchangeability in the conformal prediction framework is violated, contributing to the poor performance. The paper concludes with a rigorous negative finding, diagnosing the causes of underperformance and outlining necessary changes for future predictive models.
Methodology
The methodology involved training a Random Forest Regressor on telemetry data to predict longitudinal slip, using features such as throttle position, brake pressure, gear selection, and vehicle speed. The non-conformity scores were calculated from the residuals of the model's predictions, and a rolling volatility metric was used to flag potential anomalies. The evaluation was conducted against real incident records from FIA Race Control Messages.
Results
The rolling-volatility detector achieved a mean precision of 0.0 and mean recall of 0.0 across 55,563 telemetry samples, flagging 15.3% of samples as anomalous. The static 95th-percentile threshold baseline did not perform better, and diagnostics indicated a violation of the exchangeability assumption for all drivers.
Implications
The findings suggest that while conformal prediction methods hold promise for early warning systems, significant methodological adjustments are necessary for practical applications in real-time scenarios. Future research should focus on addressing the identified limitations to enhance predictive capabilities.
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Graph Learning
- Introduction of SA-HGNN, a model specifically designed for EEG-based depression recognition.
- Dynamic construction of personalized brain network topologies to better capture complex spatial relationships.
- Utilization of hyperbolic geometry to address the limitations of Euclidean space in modeling hierarchical structures.
- Incorporation of an Attention Pooling module to mitigate noise interference in EEG signals.
Read more
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Summary
This paper presents the Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN), a novel approach for recognizing Major Depressive Disorder (MDD) through EEG data. The authors highlight the limitations of existing Graph Neural Networks (GNNs) in capturing the hierarchical structure of brain networks affected by depression. The proposed SA-HGNN consists of three main components: a Sample-Adaptive Graph Construction module that creates personalized brain network topologies, a Hyperbolic Graph Convolution module that utilizes hyperbolic geometry to model complex hierarchical relationships, and an Attention Pooling module that filters out noise from EEG signals. The model is designed to improve the accuracy of depression recognition by effectively capturing abnormal functional connectivity patterns in the brain. Extensive experiments on public EEG datasets demonstrate that SA-HGNN outperforms traditional GNNs based on Euclidean metrics in both resting-state and task-related paradigms, showcasing its robustness against noise and its efficacy in identifying the hierarchical organization of brain connectivity in MDD patients.
Methodology
The SA-HGNN model employs three core modules: (1) Sample-Adaptive Graph Construction for personalized topology creation, (2) Hyperbolic Graph Convolution to leverage hyperbolic space for capturing hierarchical relationships, and (3) Attention Pooling to filter out redundant noise from EEG signals. This combination allows for a more accurate representation of the brain's functional connectivity in patients with depression.
Results
The experimental results indicate that SA-HGNN significantly outperforms traditional GNNs that operate in Euclidean space, achieving better accuracy in recognizing MDD across various EEG datasets and conditions, including both resting-state and task-related paradigms.
Implications
The findings suggest that SA-HGNN could serve as a powerful tool for the automated diagnosis of Major Depressive Disorder, potentially leading to earlier and more accurate identification of the condition. This could have significant implications for mental health treatment and intervention strategies.
Finite-Lag Operator Geometry of Recurrent Representations
Theory
Time Series
Optimization
- Introduces finite-lag operator geometry for recurrent representations, focusing on dynamics rather than static snapshots.
- Defines key constructs such as the conditional transport law Qβ and the source-centered transport tensor Gβ.
- Proves structural results including affine covariance and stability of the Gaussian estimator on bounded trajectory clouds.
- Demonstrates the ability to detect deterministic recurrent motion not captured by traditional methods.
Read more
Finite-Lag Operator Geometry of Recurrent Representations
Summary
This paper introduces a novel framework for analyzing recurrent representations in machine learning through finite-lag operator geometry. Unlike traditional methods that assess representation geometry from static snapshots, this approach focuses on the dynamics of recurrent hidden states by examining source-successor pairs over a fixed lag. The core concept is the conditional transport law Qβ, which is estimated using a dense Gaussian source-smoothing operator. From this, the author derives a source-centered transport tensor Gβ that captures both conditional spread and coherent displacement, as well as an antisymmetric coordinate circulation WΟβ that summarizes directed lagged flow. The paper establishes several structural results, including affine covariance and stability of the estimator, and demonstrates that deterministic recurrent motion can be detected even when traditional infinitesimal methods fail. Controlled experiments validate the theoretical predictions, revealing architecture-dependent differences in transport scale and coherent displacement in performance-matched networks. This framework provides a comprehensive geometric perspective on recurrent representations, enhancing our understanding of their dynamics and potential applications.
Methodology
The methodology involves defining a finite-lag conditional transport law Qβ based on observed source-successor pairs. The author employs a dense Gaussian source-smoothing operator to estimate this law, leading to the derivation of the source-centered transport tensor Gβ and antisymmetric circulation WΟβ. The paper also includes theoretical proofs of structural properties and conducts controlled experiments to validate the proposed framework.
Results
The main results include the successful derivation of the transport tensor Gβ, which decomposes into conditional spread and coherent displacement, and the antisymmetric circulation WΟβ that captures directed flow. The framework demonstrates stability and affine covariance, and it reveals significant differences in transport characteristics across different network architectures in controlled experiments.
Implications
The implications of this work extend to improved analysis and understanding of recurrent neural networks, particularly in how they process and represent temporal information. The finite-lag operator geometry framework can enhance model interpretability and may lead to better performance in tasks involving sequential data.
Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs
Generative Models
Optimization
Theory
- Introduces a population-based training framework for SSL-GANs that separates supervised and unsupervised losses.
- Utilizes Pareto-based selection to maintain diverse discriminator populations, improving training stability.
- Demonstrates improved classification accuracy and robustness over existing SSL-GAN methods.
- Explores various evolutionary strategies, including elitist replacement and mono-objective ablation.
Read more
Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs
Summary
This paper addresses the instability issues in training Semi-Supervised Generative Adversarial Networks (SSL-GANs) by proposing a novel population-based evolutionary training strategy. The authors formulate the training of discriminators as a multi-objective optimization problem, separating the supervised and unsupervised learning tasks instead of aggregating them into a single scalar loss. This approach maintains a population of discriminators ranked by Pareto dominance, allowing for the exploration of various trade-offs between classification accuracy and real/fake discrimination. The proposed method, named CO-evolutionary Multi-Objective Discriminator SSL-GAN (COMOD-SSL-GAN), is evaluated against state-of-the-art baselines, demonstrating improved robustness in training and superior classification accuracy, particularly with an elitist variant. The paper also includes an empirical analysis of different evolutionary strategies, highlighting the effectiveness of multi-objective selection in enhancing the performance of SSL-GANs.
Methodology
The authors propose a multi-objective optimization framework for training discriminators in SSL-GANs, where the supervised and unsupervised losses are treated as separate objectives. They maintain a population of discriminators and apply Pareto dominance for selection, allowing for diverse trade-offs between the two objectives. The methodology includes an analysis of different evolutionary strategies, such as elitist replacement and mono-objective ablation, to assess their impact on training performance.
Results
Experiments conducted on the MNIST dataset with limited labeled samples show that the proposed COMOD-SSL-GAN outperforms existing state-of-the-art methods, including SSL-GAN and CE-SSL-GAN, in terms of training robustness and classification accuracy. The elitist variant consistently achieves the highest classification accuracy among the tested strategies.
Implications
The proposed method has significant implications for improving the training of SSL-GANs, particularly in scenarios with limited labeled data. It can enhance the performance of generative models in various applications, such as image generation and semi-supervised learning tasks, by providing a more stable and effective training approach.
Online Resource Allocation with Continuous Random Consumption: Regret under Degeneracy
Theory
Optimization
- Introduces a model for online resource allocation with continuous random consumption.
- Defines an active weighted-mass exponent to analyze additive regret.
- Demonstrates that continuous random consumption can lead to polynomial regret in certain cases.
- Shows that a sample-path marginal policy can achieve logarithmic regret under specific conditions.
Read more
Online Resource Allocation with Continuous Random Consumption: Regret under Degeneracy
Summary
This paper investigates the challenges of online resource allocation in scenarios where both rewards and consumption sizes are continuously distributed. The author presents a model where requests arrive sequentially and must be accepted or rejected without the possibility of reversal, under fixed resource capacities. Each request is categorized into observable types, with both the reward and size being random variables. The study highlights that the additive regret, which measures the loss incurred due to online decision-making compared to an optimal offline strategy, is influenced by the size-weighted mass of requests near the acceptance cutoffs. The paper introduces an active weighted-mass exponent, p, to formalize this relationship. When p > 1, the regret is at least T^(1/2 - 1/(2p)), indicating a hard problem, while a sample-path marginal policy can achieve regret close to this lower bound, with O((log T)^2) regret when p = 1. The findings suggest that continuous random consumption alters the worst-case regret exponent, revealing that polynomial regret can occur in certain bounded-density instances. The results emphasize the significance of degeneracy in practical applications, where the expected demand can lead to fluctuating resource bottlenecks.
Methodology
The author develops a stochastic online allocation model that incorporates accept/reject decisions for requests with continuous distributions of rewards and sizes. The analysis focuses on the relationship between the size-weighted mass of requests and the regret incurred by online policies, using the active weighted-mass exponent to characterize the problem's complexity.
Results
The main findings indicate that when the active weighted-mass exponent p > 1, online policies incur a minimum regret of order T^(1/2 - 1/(2p)). A sample-path marginal policy achieves this lower bound, while for p = 1, the policy can attain O((log T)^2) regret. The paper also shows that allowing continuous random consumption can significantly alter the worst-case regret exponent, leading to polynomial regret in certain bounded-density instances.
Implications
The results have important implications for fields such as network revenue management, online advertising, and order fulfillment, where resource allocation decisions are made in real-time under uncertainty. Understanding the effects of degeneracy and continuous distributions can help in designing more effective online policies that minimize regret.
Hybrid quantum-classical neural network for sentiment analysis
NLP
- Hybrid quantum-classical neural networks can effectively perform sentiment analysis.
- The study utilizes a dataset of COVID-19-related tweets for sentiment classification.
- Hybrid models show comparable accuracy to classical models but with enhanced learning dynamics.
- Transfer learning experiments reveal significant improvements in performance for spam classification tasks.
Read more
Hybrid quantum-classical neural network for sentiment analysis
Summary
This paper explores the integration of hybrid quantum-classical neural networks (HNNs) for sentiment analysis, specifically focusing on a dataset of tweets related to COVID-19. The authors utilize a combination of classical feedforward networks and parameterized quantum circuits to analyze sentiment in text data. The tweets are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) method, and the performance of hybrid models is compared against classical baselines. The findings indicate that hybrid models achieve comparable accuracy to classical models while demonstrating distinct learning dynamics, suggesting a richer representational capacity. Furthermore, the hybrid models excel in transfer learning scenarios, significantly improving accuracy in a spam classification task. This research highlights the potential of quantum machine learning (QML) in natural language processing (NLP) and suggests that hybrid architectures may offer advantages as quantum technology advances.
Methodology
The authors employed a dataset of tweets annotated with sentiment labels, which were preprocessed and vectorized using TF-IDF. They compared classical feedforward neural networks with hybrid architectures that incorporate parameterized quantum circuits. The quantum components were simulated classically, and the models were evaluated on sentiment classification and transfer learning tasks.
Results
The hybrid models achieved accuracy comparable to classical baselines in sentiment analysis. Notably, in transfer learning for SMS spam classification, the hybrid models outperformed classical counterparts, achieving an accuracy increase of 15 percentage points (from 66% to 81%) on the spam class, indicating enhanced generalization capabilities.
Implications
The findings suggest that hybrid quantum-classical models could improve sentiment analysis and other NLP tasks, particularly as quantum hardware continues to develop. This research may guide future algorithmic advancements and the design of quantum systems for practical applications in social media monitoring, public health, and political analysis.
Black-Box Inference of LLM Architectural Properties with Restrictive API Access
Large Language Models
NLP
Theory
- Introduces NightVision, an attack for inferring LLM architectural properties under restrictive API access.
- Utilizes a novel common-set prompting technique to recover hidden dimensions without logit bias or top-k access.
- Employs timing measurements to estimate depth and parameter count based on model characteristics.
- Achieves significant accuracy in recovering architectural parameters across various open-source LLMs.
Read more
Black-Box Inference of LLM Architectural Properties with Restrictive API Access
Summary
This paper addresses the challenge of inferring architectural properties of large language models (LLMs) when access to their APIs is restricted. The authors introduce NightVision, a novel attack that estimates key architectural parameters such as hidden dimension, depth, and parameter count using only limited API access, specifically single logit outputs and timing measurements. The methodology involves a common-set prompting technique that allows for the recovery of hidden dimensions without needing logit bias or top-k access. Additionally, the authors leverage end-to-end timing to infer depth and parameter count based on the scaling characteristics of transformer inference. The empirical evaluation of NightVision on 32 open-source LLMs demonstrates its effectiveness, achieving an average relative error of 23% for hidden dimensions and 53% for depth and parameter count in larger models. The findings suggest that current API restrictions are insufficient to fully conceal architectural details, raising concerns about the security and intellectual property of LLMs.
Methodology
The methodology consists of two main components: (1) a common-set prompting technique that allows for the recovery of hidden dimensions using single logit outputs, and (2) a timing-based recovery procedure that estimates depth and parameter count by analyzing the scaling of inference time with respect to model parameters. The authors also provide theoretical bounds on the number of API calls required for effective recovery.
Results
NightVision successfully recovers hidden dimensions with an average relative error of 23% across 32 models, achieving exact recovery in 4 cases and within 10% in 12 cases. For models with over three billion parameters, the method estimates depth and parameter count with an average relative error of approximately 53%. The accuracy of these estimates is shown to depend on the token budget and model properties.
Implications
The findings have significant implications for LLM developers and security researchers, indicating that even with restricted API access, sensitive architectural properties can still be inferred. This raises concerns about the protection of intellectual property and the need for more robust API designs that account for potential inference attacks.
Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Graph Learning
Optimization
Theory
- Introduction of dynamic neural graphs for modeling neural network parameters.
- Development of the Dynamic Neural Graph Encoder (DNG-Encoder) to process dynamic graphs.
- Creation of INR2JLS for mapping INR weights into a joint latent space.
- Demonstration of improved classification accuracy on CIFAR-10 and CIFAR-100 datasets.
Read more
Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Summary
This paper addresses the challenges of processing high-dimensional weight spaces in neural networks by introducing a novel approach that utilizes dynamic graphs to represent neural network parameters. The authors propose the Dynamic Neural Graph Encoder (DNG-Encoder), which captures the temporal dynamics of inference processes in a sequential manner, reflecting the layer-by-layer processing inherent in neural networks. Additionally, the DNG-Encoder is employed to develop INR2JLS (Implicit Neural Representation to Joint Latent Space), which facilitates downstream applications such as classifying Implicit Neural Representations (INRs). The proposed method shows significant improvements in classification accuracy on the CIFAR-100-INR dataset, outperforming state-of-the-art methods by approximately 10%. The work emphasizes the importance of dynamic representations in effectively modeling neural network behavior during inference, thus providing a more informative latent space for various applications.
Methodology
The authors propose a recurrent-like graph neural network, the DNG-Encoder, to process dynamic neural graphs that evolve over time. This method mirrors the forward propagation mechanism of neural networks, preserving the sequential characteristics of data flow through layers. The DNG-Encoder is then utilized to create INR2JLS, which learns a joint latent space between deep weights and original data, enhancing the representation for downstream tasks.
Results
The proposed DNG-Encoder and INR2JLS demonstrate significant performance improvements, achieving approximately 10% higher classification accuracy on the CIFAR-100-INR dataset compared to existing state-of-the-art methods. Similar improvements are noted on the CIFAR-10 dataset, showcasing the effectiveness of the dynamic graph approach.
Implications
The findings suggest that dynamic graph representations can significantly enhance the modeling of neural network parameters, leading to better performance in tasks involving implicit neural representations. This approach could be beneficial for various applications in machine learning, particularly in optimizing neural networks and improving classification tasks.
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning
Reinforcement Learning
Efficient ML
- Introduction of the 'rollout infrastructure tax' concept, highlighting the impact of execution substrate on RL efficiency.
- Significant variations in performance metrics (cold-start latency and worker-hours) across different execution substrates.
- Proposal of design requirements for optimizing rollout-native substrates to enhance coding-agent RL performance.
- Emphasis on the need to treat execution infrastructure as a core concern in the development of RL systems.
Read more
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning
Summary
This paper addresses the often-overlooked impact of execution infrastructure on the efficiency of coding-agent reinforcement learning (RL). The authors introduce the concept of the 'rollout infrastructure tax,' which refers to the latency and costs associated with the systems that execute coding-agent RL trajectories. They conduct a comparative study of four execution substrates: single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud virtual machines. The study reveals significant variations in cold-start latency (up to 110Γ) and projected worker-hours (1.8Γ spread) for one million 150-step trajectories. The authors argue that optimizing execution substrates should be an integral part of the training system, rather than merely a deployment consideration. They propose three design requirements for rollout-native substrates: locality-aware warm pools, low-latency action APIs, and appropriate isolation mechanisms. This work emphasizes the necessity of measuring and optimizing infrastructure overhead to improve the efficiency of coding-agent RL systems.
Methodology
The authors conducted a measurement study comparing four common execution substrates while holding the coding-agent workload fixed. They defined and quantified the rollout infrastructure tax by analyzing various components such as environment creation time, readiness time, per-action costs, and orchestration overhead.
Results
The study found that the choice of execution substrate significantly affects rollout performance, with cold-start latency varying by up to 110Γ and a 1.8Γ spread in projected worker-hours for one million 150-step trajectories. These findings indicate that small per-rollout savings can lead to substantial efficiency gains at scale.
Implications
The findings suggest that optimizing execution infrastructure can lead to more efficient coding-agent RL systems, particularly as workloads scale. This work encourages future research to focus on infrastructure optimization as a critical aspect of RL system design.
Rank-Then-Act: Reward-Free Control from Frame-Order Progress
Reinforcement Learning
Computer Vision
Multimodal
- RTA provides a framework for learning control policies without environment rewards.
- The method utilizes a VisionβLanguage Model trained on shuffled video segments to derive ordinal progress rankings.
- A correlation-based reward signal is introduced, computed as Spearman correlation, allowing for stable learning across tasks.
- RTA outperforms existing video-based reward learning methods and demonstrates strong generality across tasks.
Read more
Rank-Then-Act: Reward-Free Control from Frame-Order Progress
Summary
The paper introduces Rank-Then-Act (RTA), a novel framework for learning control policies from expert video demonstrations without relying on environment rewards. RTA employs a VisionβLanguage Model (VLM) trained offline as a progress-based ordinal scorer using a Group Relative Policy Optimization (GRPO) objective on shuffled video frames. This approach encourages the model to discern temporal order based on visual semantics rather than simple time cues. Instead of using the VLM as a scalar reward model, RTA defines a correlation-based reward function for reinforcement learning, specifically computing the Spearman rank correlation between predicted progress rankings and actual temporal indices within a sliding window. This method provides a bounded, scale-invariant learning signal, facilitating stable transfer across different tasks and environments. The evaluation of RTA on various discrete and continuous control benchmarks demonstrates that it consistently matches or surpasses prior video-based reward learning methods and rank-based baselines, while also showcasing strong cross-task reuse of a single pretrained progress scorer. The findings suggest that correlation-structured supervision derived from video ordinal signals is a viable alternative to explicit reward design, enabling effective policy learning in reward-free settings.
Methodology
RTA consists of two main stages: (1) training a VisionβLanguage Model (VLM) as a progressive ordinal estimator using Group Relative Policy Optimization (GRPO) on shuffled video segments, and (2) defining a correlation-based reinforcement learning signal that computes the Spearman rank correlation between predicted progress ranks and true temporal indices over a sliding window.
Results
RTA was evaluated on discrete control benchmarks (PyBoy: Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld), consistently matching or outperforming prior methods. The results indicate effective control without environment rewards and demonstrate the ability of a single pretrained progress scorer to transfer across various tasks and environments.
Implications
The findings suggest that RTA can facilitate the development of generalist agents capable of learning from video demonstrations in environments where traditional reward design is challenging or impractical. This could have applications in robotics, gaming, and other domains where reward signals are difficult to engineer.
Role-Aware Neural Convex Divergence Heads for Asymmetric Representation Learning
NLP
Graph Learning
Theory
- Introduction of a role-aware neural convex divergence head for asymmetric representation learning.
- Theoretical characterization of the proposed method, retaining classical Bregman properties.
- Empirical validation shows improved directional accuracy on multiple benchmarks.
- The method serves as a plug-in distance module for various encoder architectures.
Read more
Role-Aware Neural Convex Divergence Heads for Asymmetric Representation Learning
Summary
This paper addresses the challenges of representation learning in scenarios involving directed relations, such as lexical entailment and citation links. Traditional symmetric distance metrics (e.g., Euclidean, cosine) fail to capture the asymmetry inherent in these relationships. The authors propose a novel approach called the role-aware neural convex divergence head, which incorporates source- and target-role projections into the evaluation of a neural Bregman divergence. This method yields a nonnegative structured score that respects the directional nature of the relationships. The paper provides a theoretical framework for understanding the properties of the proposed head, including its identity in projected space and local curvature analysis. Empirical evaluations across various benchmarks demonstrate that the role-aware projections significantly enhance directional accuracy compared to standard methods while maintaining a zero observed negative divergence rate. However, in specific cases, such as large fixed-feature citation prediction, traditional symmetric methods still outperform the proposed approach in ranking accuracy. Overall, the proposed head serves as a structured, interpretable module for tasks requiring directional representation.
Methodology
The proposed method utilizes role-specific projections to transform input embeddings into source and target roles before computing a neural Bregman divergence. This approach allows for the integration of asymmetric distance metrics into existing representation learning frameworks. The authors conducted systematic experiments comparing their method against symmetric distances, unstructured asymmetric scorers, and other baselines across multiple tasks.
Results
The experiments revealed that the role-aware neural convex divergence head consistently improved directional accuracy across ten random seeds on semantic and ontology benchmarks. It maintained a zero observed negative divergence rate, indicating effective handling of asymmetry. However, in the context of large fixed-feature citation prediction, specialized symmetric or hyperbolic baselines demonstrated superior ranking accuracy.
Implications
The findings suggest that the role-aware neural convex divergence head can enhance representation learning in applications where directional relationships are crucial. This could be particularly beneficial in natural language processing tasks, knowledge graph embeddings, and other domains where asymmetric relations are prevalent.
Generalization in offline RL: The structure is more important than the amount of pessimism
Reinforcement Learning
Theory
Robotics
- The structure of pessimism is more critical for generalization than the amount of pessimism in offline RL.
- A symmetric value function can lead to better generalization compared to a mildly pessimistic, non-symmetric one.
- Data augmentation should prioritize symmetry preservation during policy extraction to improve generalization.
- Empirical results validate the effectiveness of the proposed methods in a rotationally symmetric environment.
Read more
Generalization in offline RL: The structure is more important than the amount of pessimism
Summary
This paper investigates the role of pessimism in offline reinforcement learning (RL) and its impact on generalization in contextual Markov decision processes (CMDPs). The authors argue that the structure of pessimism, rather than its quantity, is crucial for achieving optimal generalization. They demonstrate that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. The paper introduces the concept of generalization-through-invariance in zero-shot policy transfer settings, emphasizing that successful generalization relies on learning the correct symmetries from training data. The authors propose that data augmentation (DA) should focus on maintaining symmetry during policy extraction, rather than merely expanding the dataset. Empirical validation is conducted using two offline RL algorithms, IQL and CQL, in a rotationally symmetric reacher environment, showing that applying a consistency loss during policy extraction significantly enhances generalization performance.
Methodology
The authors employ theoretical analysis to establish the importance of symmetry in pessimistic value functions for generalization. They introduce the generalization-through-invariance framework and validate their findings empirically using IQL and CQL algorithms in a rotationally symmetric continuous control environment, applying data augmentation techniques focused on symmetry.
Results
The study finds that enforcing a symmetric value function, even with high levels of pessimism, can lead to optimal generalization. The empirical results demonstrate that using a consistency loss during policy extraction significantly improves generalization performance compared to traditional data augmentation methods.
Implications
The findings suggest that offline RL methods should focus on the structural aspects of pessimism to enhance generalization capabilities, particularly in environments with complex symmetries. This can lead to more robust policy learning in real-world applications where data is limited.
Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence
Reinforcement Learning
Robotics
- Introduces a two-stage learning pipeline for wind estimation and control in small quadrotors.
- Achieves high accuracy in wind estimation using an attention-augmented GRU network.
- Demonstrates a significant reduction in trajectory tracking error with a wind-aware RL controller.
- Highlights the regime-dependent value of wind perception in improving control performance.
Read more
Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence
Summary
This paper addresses the challenges faced by small multirotor aircraft operating in turbulent wind conditions, where conventional control methods often fail. The authors propose a two-stage learning pipeline that first estimates local wind conditions using an attention-augmented gated recurrent network (GRU) trained on simulated flights through von KΓ‘rmΓ‘n turbulence. The wind estimator achieves a root-mean-square error (RMSE) of 0.40 m/s and a direction error of 3.2Β° in unseen wind regimes. The second stage involves a reinforcement learning (RL) controller, specifically a proximal policy optimization (PPO) algorithm, which utilizes the wind estimates to improve trajectory tracking. The results show that the wind-aware controller reduces horizontal trajectory tracking error by 48% compared to a traditional wind-blind proportional-derivative (PD) controller across various wind speeds. The paper also highlights the importance of wind perception, demonstrating that its contribution to performance improves with wind speed, particularly in stronger winds. Overall, the findings suggest that integrating learned wind perception into control systems can significantly enhance the autonomy of small uncrewed aircraft systems (sUAS) in challenging wind conditions.
Methodology
The methodology involves a two-stage architecture: first, a wind estimator using an attention-augmented gated recurrent network (GRU) processes onboard kinematics and dynamics to infer the ambient wind vector. Second, a proximal policy optimization (PPO) controller utilizes the wind estimates to enhance flight control, allowing for anticipatory handling of disturbances rather than reactive adjustments.
Results
The wind estimator achieved a per-flight RMSE of 0.40 m/s and a direction error of 3.2Β°, demonstrating strong generalization to unseen wind conditions. The PPO controller outperformed the wind-blind PD baseline by reducing horizontal tracking error by 48% and vertical-regime horizontal-axis RMSE by 39.5%. The performance improvement was decomposed into kinematic and wind-perception components, with the latter becoming more significant at higher wind speeds.
Implications
The findings suggest that integrating learned wind perception into control systems can greatly enhance the autonomy and performance of small uncrewed aircraft systems (sUAS) in turbulent environments, potentially leading to more reliable operations in various applications such as search and rescue, environmental monitoring, and agricultural surveying.
Ask the Right Comparison: Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges
NLP
Large Language Models
Optimization
- Introduces a bias-aware Bayesian model for ranking with LLM judges, addressing systematic biases in their evaluations.
- Develops a top-k-aware active acquisition strategy that focuses on identifying the top-k items efficiently under a fixed comparison budget.
- Demonstrates significant improvements in identifying top-k items compared to naive aggregation methods, especially with biased judges.
- Finds that verbosity bias is prevalent among cheaper judges, while frontier judges exhibit minimal bias.
Read more
Ask the Right Comparison: Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges
Summary
This paper addresses the challenges of using large language models (LLMs) as judges for pairwise comparisons in ranking tasks, highlighting their inherent biases and noise. The authors propose a novel Bayesian framework that incorporates judge-specific biases, such as verbosity and position effects, into the ranking process. They introduce a bias-aware Bayesian model that utilizes a shrinkage prior to adaptively estimate the biases of LLM judges, allowing for more accurate recovery of latent quality in the top-k items. Additionally, the paper presents a top-k-aware active acquisition rule that optimally selects comparisons to minimize uncertainty about the top-k membership, rather than focusing on the entire ranking. The methodology is validated through experiments with sixteen real LLMs on a controlled benchmark, demonstrating that naive aggregation methods fail to identify the correct top-k items, while the proposed bias-aware model significantly improves performance, particularly with cheaper, more biased judges. The findings reveal that bias is heterogeneous and capability-dependent, emphasizing the need for online estimation of biases for effective LLM evaluation.
Methodology
The authors model the judging process as a Bayesian Bradley-Terry process, incorporating observable bias covariates (e.g., verbosity and position) into the ranking framework. A shrinkage prior is employed to adaptively learn the biases of each judge. The active acquisition rule is designed to select comparisons that most effectively reduce uncertainty regarding the top-k membership, leveraging the posterior distribution of item quality.
Results
The experiments reveal that naive aggregation methods converge to incorrect top-k rankings due to biases in LLM judges. In contrast, the proposed bias-aware model successfully recovers the correct top-k items, achieving high recall rates (up to 1.0) for biased judges. The top-k-aware acquisition strategy also demonstrates efficiency, requiring fewer comparisons to reach optimal performance compared to traditional methods.
Implications
The findings suggest that incorporating bias-aware modeling in LLM evaluations can lead to more accurate rankings in various applications, such as chatbot response selection, model comparison, and document triaging. This approach can enhance the reliability of automated evaluations in natural language processing tasks, particularly when using cost-effective LLM judges.
X-LogSMask: Expand Transformer for Graph-Structured Data
Graph Learning
- X-LogSMask introduces a logarithmic structural mask for graph data, enhancing interpretability and efficiency.
- The method allows multi-hop information propagation within a single Transformer layer by assigning different powers of the adjacency matrix to attention heads.
- X-LogSMask achieves state-of-the-art performance on 13 out of 20 benchmark datasets.
- The approach maintains the core Transformer architecture while improving its applicability to graph-structured data.
Read more
X-LogSMask: Expand Transformer for Graph-Structured Data
Summary
The paper introduces X-LogSMask, a novel approach to adapt Transformer architectures for graph-structured data, addressing the limitations of traditional self-attention mechanisms in handling sparse and structured interactions typical of graphs. The authors propose a multi-head logarithmic structural mask that integrates symmetrically normalized graph topology directly into attention logits. This method transforms structural connectivity into a topology-aware gating signal, effectively suppressing irrelevant node interactions while maintaining feature-dependent attention. By assigning different powers of the normalized adjacency matrix to various attention heads, X-LogSMask enables multi-hop information propagation within a single layer, enhancing interpretability and efficiency. The authors demonstrate that a standard Transformer encoder can be viewed as a one-step message passing mechanism on a complete graph, positioning X-LogSMask as a topology-constrained alternative to unrestricted self-attention. The proposed method achieves state-of-the-art performance across 20 benchmarks, outperforming existing models in 13 datasets while remaining competitive in a lightweight one-layer configuration. This work highlights the potential of simple, interpretable structural masks to enhance self-attention mechanisms for effective graph learning without altering the fundamental Transformer architecture.
Methodology
X-LogSMask is constructed from a symmetrically normalized adjacency matrix with self-loops, which is injected additively into the attention logits of a standard Transformer. The logarithmic transformation creates a topology-aware gating signal that suppresses unsupported node interactions while preserving the discriminative properties of attention. Different powers of the normalized adjacency matrix are assigned to different attention heads, allowing each head to specialize in a specific structural radius.
Results
The implementation of X-LogSMask led to state-of-the-art performance on 13 datasets across 20 node-, edge-, and graph-level benchmarks. The method demonstrated competitive results even in a lightweight one-layer configuration, indicating its efficiency and effectiveness in graph learning tasks.
Implications
The findings suggest that incorporating simple and interpretable structural masks can significantly enhance the performance of self-attention mechanisms in graph learning. This approach opens avenues for more efficient and interpretable models in various applications involving graph-structured data, such as social networks, molecular structures, and transportation systems.
kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
NLP
Large Language Models
- kNNGuard is a training-free guardrail framework that utilizes LLM hidden activations for prompt classification.
- It achieves competitive or superior F1 scores compared to fine-tuned guardrails while being significantly faster.
- The framework allows for rapid domain adaptation by updating a small reference bank of labeled examples.
- kNNGuard employs a multi-layer kNN approach that fuses activation-space and embedding-space scores.
Read more
kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
Summary
The paper introduces kNNGuard, a novel guardrail framework designed for large language models (LLMs) that operates without the need for training or fine-tuning. Traditional guardrails often rely on fine-tuned classifiers, which can suffer from low generalization and high inference latency. In contrast, kNNGuard leverages the hidden activations of a frozen LLM to classify prompts as safe or unsafe. By utilizing a small labeled bank of 50 examples, kNNGuard extracts hidden activations and employs a multi-layer k-nearest neighbors (kNN) approach that fuses activation-space and embedding-space scores for improved classification. The framework demonstrates competitive performance across six diverse domains, achieving an average F1 score of 87.4% and a false positive rate of 12.9%, while operating at 2.7 times lower latency than the best existing guardrails. The adaptability of kNNGuard allows for rapid domain-specific updates, requiring only a single forward pass per bank example, making it practical for real-time applications. The paper also discusses the influence of system prompts and layer selection on performance, providing insights for integrating kNNGuard into production LLM pipelines.
Methodology
kNNGuard operates by extracting hidden activations from a frozen LLM using a small bank of labeled prompts. It employs a multi-layer kNN classification approach, fusing activation-space and embedding-space scores through adaptive confidence-based fusion. The framework allows for quick updates to the labeled bank, enabling rapid domain adaptation without the need for fine-tuning.
Results
kNNGuard achieved an average F1 score of 87.4% with a false positive rate of 12.9% across six domains, demonstrating competitive performance against state-of-the-art fine-tuned guardrails. It operates at a per-prompt latency of 45.9 ms, which is 2.7 times faster than the best comparable guardrail and 10 times faster than a fine-tuned safety classifier.
Implications
The kNNGuard framework presents a practical solution for integrating guardrails into LLM applications, particularly in critical domains where safety and security are paramount. Its training-free nature and rapid adaptability make it suitable for real-time applications, enhancing the robustness of LLMs against adversarial prompts and ensuring compliance with safety policies.
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
NLP
Large Language Models
Generative Models
- Set diffusion allows for flexible-length and flexible-position token generation, improving decoding flexibility.
- The set-causal diffusion architecture supports KV cache updates after every inference step, enhancing efficiency.
- Set diffusion achieves better speed-quality tradeoffs than prior diffusion models across various tasks.
- The method outperforms block diffusion in infilling performance.
Read more
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
Summary
The paper introduces a novel class of language models called set diffusion, which enhances the capabilities of discrete diffusion models by allowing for flexible-position and flexible-length token sets. Traditional autoregressive (AR) models and block diffusion methods are limited by fixed-length generation and lack of key-value (KV) caching. Set diffusion overcomes these limitations by enabling arbitrary-order token generation and supporting KV cache updates after each inference step. This flexibility allows for faster inference and improved performance in tasks requiring variable-length generation, such as mathematical reasoning, summarization, and unconditional generation. The authors demonstrate that set diffusion achieves superior speed-quality tradeoffs compared to previous diffusion models and outperforms block diffusion in infilling tasks. The proposed method not only enhances decoding flexibility and parallelism but also provides a more efficient approach to language modeling.
Methodology
The authors propose set diffusion as a new autoregressive probability distribution over flexible-length, flexible-position sets of discrete random variables. This is achieved through a set-causal diffusion architecture that allows for KV cache updates after each generation step. The methodology includes a likelihood parameterization that factorizes over token sets, enabling arbitrary-order generation and improved inference efficiency.
Results
Set diffusion demonstrates significant improvements in speed and quality tradeoffs on tasks such as mathematical reasoning, summarization, and unconditional generation. It achieves state-of-the-art performance among diffusion models and shows substantial gains in infilling tasks compared to block diffusion.
Implications
The introduction of set diffusion has the potential to enhance various applications in natural language processing, particularly in tasks that require flexible and efficient token generation. This could lead to advancements in real-time language modeling, interactive AI systems, and applications where rapid response times are critical.
Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander
Reinforcement Learning
Robotics
Optimization
- Introduces a suite of 40 structural validation-time metrics for evaluating world models.
- Presents the Composite Reward Observability Fraction (CROF) for offline checkpoint selection.
- Demonstrates that CROF correlates with better closed-loop performance compared to traditional metrics.
- Achieves a significant improvement in return using the CROF-selected model with fewer real-environment interactions.
Read more
Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander
Summary
This paper addresses the challenge of selecting the optimal checkpoint from a latent world model training run to predict closed-loop performance in model-based reinforcement learning (RL). The author highlights that traditional metrics such as validation loss and multi-step prediction RMSE may not correlate with actual closed-loop performance, particularly in environments with non-Markovian rewards, like LunarLander-v3. The study introduces a suite of 40 structural validation-time diagnostics based on optimal control theory to evaluate world models. The key contribution is the Composite Reward Observability Fraction (CROF), which combines various structural metrics into a single score for offline checkpoint selection. The CROF-selected world model significantly outperforms a model-free A2C baseline in terms of return while requiring substantially fewer real-environment interactions. This work demonstrates the importance of structural properties in learned dynamics for effective planning and policy optimization in model-based RL.
Methodology
The methodology involves training a Recurrent State Space Model (RSSM) on the LunarLander-v3 environment and evaluating it using a set of 40 structural metrics derived from optimal control theory. The CROF score is computed by combining the Reward Observability Fraction with other structural regularizers. The performance of the selected models is assessed through closed-loop policies using both Model Predictive Control (MPC) and model-based actor-critic methods.
Results
The results indicate that the CROF-selected world model leads to a model-based A2C policy that outperforms a model-free A2C baseline by approximately 24.5 return points while utilizing around 65 times fewer real-environment interactions. The study also shows that the CROF effectively tracks the performance of zero-shot CEM-MPC returns and downstream A2C training quality.
Implications
The findings suggest that incorporating structural validation-time diagnostics can enhance the selection process of world models, leading to improved performance in model-based RL applications. This approach could be beneficial in various environments with complex reward structures, potentially advancing the efficiency and effectiveness of reinforcement learning strategies.
EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction
Graph Learning
Time Series
Theory
- EHHN introduces a heterogeneous hypergraph representation for object-centric next activity prediction, preserving multi-object participation.
- The dual-stream architecture effectively models both local event-driven state changes and global execution patterns.
- EHHN achieves superior performance compared to existing OCEL-based predictors, with notable improvements in accuracy and efficiency.
Read more
EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction
Summary
This paper presents EHHN, an innovative approach for next activity prediction in service-oriented processes, particularly focusing on object-centric event logs (OCELs). Traditional methods often rely on single-case event logs, which do not adequately capture the complexity of interactions among multiple business objects. EHHN addresses this limitation by employing a heterogeneous hypergraph representation that binds events to all co-participating objects, thus preserving the multi-object context. The model utilizes a dual-stream architecture: a micro-spatial stream that models event-driven object-state evolution and a macro-evolution stream that captures temporal dynamics through global prototypes. This architecture allows EHHN to effectively predict the next activity by integrating local event-object interactions and global execution patterns. Experimental results demonstrate that EHHN outperforms existing methods, achieving significant improvements in accuracy and macro F1-score across multiple OCEL benchmarks, while also reducing peak GPU memory usage significantly.
Methodology
EHHN employs a heterogeneous hypergraph to represent event-object relationships, utilizing a dual-stream architecture consisting of a micro-spatial stream for local state evolution and a macro-evolution stream for temporal dynamics. The micro-spatial encoder updates object representations based on transient event signals, while the macro-evolution encoder applies time-aware attention to model inter-event timing and retrieve global execution patterns.
Results
EHHN demonstrated the best accuracy and macro F1-score on four public OCEL benchmarks, outperforming nine baseline models with improvements of up to 8.1 and 12.4 percentage points. Additionally, it reduced peak GPU memory usage by up to 24 times compared to the strongest OCEL-native graph baseline.
Implications
The proposed EHHN model can significantly enhance predictive process monitoring in service-oriented applications, allowing for better anticipation of upcoming activities and more efficient resource allocation. Its ability to handle complex interactions among multiple objects makes it applicable in various domains where object-centric processes are prevalent.
Multi-modal Rail Crossing Safety Analysis
Multimodal
- Integration of visual and structured data improves safety assessments at railway crossings.
- Vision-Language Models (VLMs) effectively analyze complex visual scenes for risk evaluation.
- The proposed system achieves a macro F1 score of 0.757 in classifying crossing risks.
- The methodology addresses critical challenges in data preparation and learning paradigms.
Read more
Multi-modal Rail Crossing Safety Analysis
Summary
This paper addresses the critical safety concerns associated with highway-rail grade crossings in the United States, where over 2,000 collisions occur annually. The authors propose a novel AI system that integrates visual cues from images of railway crossings with structured data from official accident reports to assess safety. The study explores the effectiveness of Vision-Language Models (VLMs) in evaluating crossing safety by analyzing both visual and historical data. A proof-of-concept pipeline is developed to classify crossings as high-risk or low-risk and to estimate safety scores based on Federal Railroad Administration (FRA) standards. The research identifies significant challenges in data preparation and learning paradigms, ultimately demonstrating that the proposed system achieves a macro F1 score of 0.757 for risk classification and an RMSE of 0.078 with a correlation of 0.492 for safety score estimation. The qualitative results align well with expert assessments, indicating the potential of AI-assisted tools in enhancing railway crossing safety assessments.
Methodology
The authors developed a multimodal pipeline that combines street-level imagery of railway crossings with historical accident data from FRA Form 57 records. They evaluated VLMs in two tasks: risk scoring (binary classification and continuous score prediction) and visual risk inspection. The study involved fine-tuning models and employing a routed prediction strategy to handle skewed risk score distributions.
Results
The proposed system identified high-risk and low-risk crossings with a macro F1 score of 0.757. It estimated FRA-based safety scores with an RMSE of 0.078 and a correlation of 0.492, demonstrating the effectiveness of the multimodal approach in aligning with expert assessments.
Implications
The findings suggest that AI-assisted tools can significantly enhance the scalability and reliability of railway crossing safety assessments, potentially leading to better resource allocation for safety interventions and improved public safety measures.
Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations
Theory
Efficient ML
- Introduces FS-PIELM to mitigate spectral bias in high-frequency PDE solutions.
- Utilizes a novel weight initialization mechanism that shifts the mean of Gaussian weights.
- Demonstrates improved accuracy over existing methods in multiple benchmark problems.
- Maintains computational efficiency with only a single linear solve required.
Read more
Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations
Summary
This paper addresses the challenge of solving high-frequency partial differential equations (PDEs) using a novel framework called Frequency Shift Physics-Informed Extreme Learning Machine (FS-PIELM). Traditional neural networks often exhibit spectral bias, favoring low-frequency components, which hampers their ability to accurately model high-frequency solutions. The authors propose an innovative weight initialization mechanism that shifts the mean of the Gaussian distribution for weights while maintaining a fixed variance, thus avoiding the variance amplification seen in scaling methods. Two variants of FS-PIELM are introduced: FS-PIELM-L, which assigns independent frequency magnitudes to neurons, and FS-PIELM-G, which groups neurons for enhanced robustness. Theoretical analysis demonstrates that the frequency variance remains bounded and approaches unity, contrasting with the quadratic growth observed in conventional methods. The FS-PIELM framework retains the computational efficiency of extreme learning machines, requiring only a single linear solve. Experimental results on seven benchmark problems across various PDE types (including Helmholtz, wave, Poisson, Klein-Gordon, heat, and advection-diffusion) reveal that the linear variant of FS-PIELM outperforms existing PIELM variants, achieving accuracy improvements ranging from one to nearly five orders of magnitude.
Methodology
The FS-PIELM framework employs an additive mechanism for weight initialization that shifts the mean of the Gaussian weight distribution while keeping the variance fixed. Two variants are developed: FS-PIELM-L, which assigns independent frequency magnitudes to neurons, and FS-PIELM-G, which groups neurons for robustness. The method is tested on various PDE types using benchmark problems, demonstrating its effectiveness compared to traditional approaches.
Results
The FS-PIELM framework achieved the best accuracy in six out of seven benchmark problems, with improvements in accuracy ranging from one to nearly five orders of magnitude over existing PIELM variants. The theoretical analysis supports the efficiency and bounded frequency variance of the proposed method.
Implications
The FS-PIELM framework has significant implications for computational science and engineering, particularly in solving complex PDEs that exhibit high-frequency behavior. It offers a more efficient and accurate alternative to traditional methods, potentially expanding the applicability of physics-informed machine learning in various fields.
Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model
Theory
- Introduces ridge-regularized log-density-ratio estimation in a Gaussian location model.
- Compares variational and spectral estimators, highlighting their performance under different observation conditions.
- Derives high-dimensional asymptotic equivalents to analyze estimator behavior.
- Demonstrates that variational estimators outperform spectral estimators with many observations.
Read more
Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model
Summary
This paper investigates ridge-regularized log-density-ratio estimation within the Gaussian location model, characterized by a common covariance matrix. The authors present two estimation approaches: a variational estimator based on empirical Kullback-Leibler (KL) divergence with β2-penalty, and a spectral estimator that reformulates the problem into a continuum of ridge-regularized least-squares problems. The study derives high-dimensional deterministic asymptotic equivalents as the number of observations and dimensions grow, revealing that the variational estimator generally has lower risk with many observations, while the spectral estimator performs better with fewer observations due to its lower variance. The paper also explores the use of a nuclear penalty for feature learning, contributing to a deeper understanding of the performance of these estimators under varying conditions of signal strength and aspect ratios.
Methodology
The paper employs a Gaussian location model with a common covariance matrix, using variational and spectral methods to estimate log-density ratios. It derives asymptotic equivalents for both estimators in high-dimensional settings and conducts empirical comparisons to analyze population risks under varying signal strengths and aspect ratios.
Results
The analysis shows that the variational estimator has a smaller population risk when the number of observations is large, while the spectral estimator is favored in scenarios with fewer observations due to its lower variance. The paper provides deterministic asymptotic limits and empirical comparisons that validate these findings.
Implications
The findings have significant implications for applications in machine learning where log-density ratio estimation is crucial, such as covariate-shift correction, two-sample testing, and variational inference. Understanding the conditions under which each estimator performs optimally can guide practitioners in selecting appropriate methods based on their data characteristics.
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL
Reinforcement Learning
Large Language Models
- Introduces a framework for analyzing policy gradient weights in RL, clarifying the effects of different advantage functions.
- Proposes FADE, a self-adapting advantage function that optimizes gradient weights based on training dynamics.
- Demonstrates that balancing positive and negative gradient masses is crucial for stable and efficient RL training.
- FADE achieves faster learning and improved accuracy-diversity trade-offs compared to static methods.
Read more
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL
Summary
This paper addresses the challenges of training stability and diversity collapse in reinforcement learning (RL) for large language models (LLMs) by introducing a unifying framework for analyzing policy gradient weights. The authors propose a novel method called FADE (Focal Advantage with Dynamic Entropy), which adapts the gradient weights based on training dynamics. They decompose policy weights into positive and negative gradient masses along two axes: the sign axis and the difficulty axis. This decomposition reveals trade-offs between exploration and exploitation during training. FADE dynamically adjusts its gradient weights to optimize learning efficiency and diversity, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines. The results demonstrate that FADE outperforms existing methods in terms of pass@k metrics on benchmark tasks, indicating its effectiveness in enhancing RL training for LLMs.
Methodology
The authors develop a framework that decomposes policy weights into positive and negative gradient masses along the sign and difficulty axes. They analyze the impact of these decompositions on training dynamics and propose FADE, which adjusts gradient weights dynamically based on the model's past performance and entropy levels. The methodology involves empirical evaluations on benchmark tasks to compare FADE against existing static advantage functions.
Results
FADE demonstrates superior performance, reaching peak pass@1 metrics 20,000 steps earlier than the best static baseline at the 7B scale and 2,000 steps earlier at the 32B scale. It also achieves the best accuracy-diversity trade-off across all pass@k metrics on LiveCodeBench and AIME, indicating its effectiveness in enhancing RL training for LLMs.
Implications
The findings suggest that dynamically adapting advantage functions can significantly improve the training of LLMs in RL settings, potentially leading to more robust and capable models in various applications such as code generation and complex problem-solving tasks.
Neuron-Aware Active Few-Shot Learning for LLMs
NLP
Large Language Models
- NEUFS shifts the selection paradigm from output-level signals to internal model dynamics.
- The framework utilizes neuron activation patterns for sample representation and selection.
- A dual-criteria selection strategy ensures diversity and targets informative samples prone to hallucinations.
- Extensive experiments show NEUFS outperforms existing AFSL methods in reasoning and text classification tasks.
Read more
Neuron-Aware Active Few-Shot Learning for LLMs
Summary
The paper introduces NEUFS, a Neuron-Aware Active Few-Shot Learning framework designed to enhance the adaptation of Large Language Models (LLMs) to specialized domains by efficiently selecting unlabeled samples for annotation. Traditional methods for sample selection in Active Few-Shot Learning (AFSL) rely heavily on output-level signals, such as predictive entropy and semantic similarities, which often neglect the internal dynamics of the models that can reveal specific knowledge gaps. NEUFS addresses this limitation by utilizing neuron activation patterns to represent samples and implementing a dual-criteria selection strategy. This strategy ensures diversity in the selected few-shot samples while also prioritizing those that are informative and likely to induce hallucinations in LLMs. The authors conducted extensive experiments across three datasets, demonstrating that NEUFS outperforms existing AFSL baselines in both reasoning and text classification tasks. Ablation studies further validate that leveraging internal neuron activations provides a more effective selection signal compared to external embeddings, showcasing the framework's potential for improving few-shot learning in specialized domains.
Methodology
NEUFS employs a dual-criteria selection strategy that leverages neuron activation patterns for clustering samples and quantifying neuron consensus to identify informative and diverse few-shot samples. This approach shifts focus from traditional output-level signals to the internal dynamics of LLMs, allowing for a more principled selection process.
Results
The experiments conducted on three datasets demonstrated that NEUFS consistently outperformed existing AFSL baselines in both reasoning and text classification tasks. The results indicate strong generalizability and competitive performance, with NEUFS achieving 1st or 2nd ranking in various evaluations. Ablation studies highlighted the superiority of using internal neuron activations for sample selection.
Implications
The findings suggest that NEUFS can significantly reduce human annotation costs while maintaining high performance in specialized domains. This framework could be particularly beneficial in fields with limited annotated data, such as education, medicine, and law, by enabling more effective adaptation of LLMs to specific tasks.
Evolutionary Feature Engineering for Structured Data
Time Series
Optimization
Large Language Models
- Introduces EFE, a framework for evolving preprocessing transformations using LLMs.
- EFE-Time improves time-series forecasting by discovering dataset-specific normalization programs.
- EFE-Tab evolves compact feature programs that enhance interpretability and performance in tabular data.
- Demonstrates significant performance improvements across various datasets and models.
Read more
Evolutionary Feature Engineering for Structured Data
Summary
This paper introduces Evolutionary Feature Engineering (EFE), a novel framework that leverages large language models (LLMs) to discover preprocessing transformations for structured data, specifically targeting time-series forecasting and tabular prediction tasks. EFE represents transformations as Python programs with a standardized fit/transform interface, enabling seamless integration into existing machine learning pipelines. The framework evolves candidate transformations based on dataset context, summary statistics, and feedback from downstream performance on validation sets. Two instantiations of EFE are presented: EFE-Time, which focuses on evolving dataset-specific, invertible normalization programs for time-series data, and EFE-Tab, which generates compact feature programs for tabular data. Experimental results demonstrate that EFE-Time significantly reduces forecasting errors across various datasets, achieving improvements of over 3% on average and up to 19% on the COVID-Deaths dataset. EFE-Tab enhances model performance, particularly with classical decision trees, by evolving interpretable features while maintaining competitive accuracy. Overall, EFE showcases the potential of LLM-based evolution to enhance both the accuracy and interpretability of machine learning models when dealing with structured data.
Methodology
The EFE framework employs LLMs to propose transformations based on dataset metadata and past performance feedback. Each transformation is evaluated using a standard fit/transform interface and is refined through an evolutionary loop that incorporates validation performance as feedback. EFE is instantiated in two settings: EFE-Time for time-series forecasting and EFE-Tab for tabular prediction, focusing on evolving invertible transformations and compact feature programs, respectively.
Results
The experiments reveal that EFE-Time achieves an average reduction in forecasting errors of over 3% across datasets, with a maximum improvement of 19% on the COVID-Deaths dataset. EFE-Tab shows superior performance in feature engineering, particularly for decision trees, achieving the best mean rank among compared methods and demonstrating strong gains in predictive accuracy while preserving interpretability.
Implications
The findings suggest that LLM-based evolutionary methods can significantly enhance feature engineering processes, making them more adaptable to specific datasets and improving model performance. This approach could be applied in various domains requiring structured data analysis, such as finance, healthcare, and environmental monitoring.
NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis
Multimodal
- NeuroBridge integrates multi-task learning with self-supervised MRI pretraining for neurodegenerative disease diagnosis.
- Achieved high classification accuracy, particularly in distinguishing AD and MCI from cognitively normal controls.
- Demonstrated strong cross-cohort generalization and effective probability-based analysis for opportunistic screening.
- Utilizes a gated fusion mechanism to combine multiple MRI-derived findings into a unified diagnostic representation.
Read more
NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis
Summary
NeuroBridge is a novel multi-task MRI framework designed to enhance the diagnosis of neurodegenerative diseases, particularly Alzheimer's disease (AD) and mild cognitive impairment (MCI). The framework addresses the challenges of subtle and heterogeneous structural changes in MRI scans that complicate accurate diagnosis. By integrating large-scale self-supervised pretraining with multi-task learning objectives, NeuroBridge captures complementary structural cues from MRI data, including hippocampal segmentation and atrophy classification. The model employs a gated fusion mechanism to adaptively combine these representations for improved diagnostic accuracy. Evaluated on the ADNI and OASIS cohorts, NeuroBridge achieved impressive classification performance, with 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS, particularly excelling in MCI-related and mixed-diagnosis scenarios. The framework demonstrated strong cross-cohort generalization and the potential for probability-based opportunistic screening, indicating its robustness and clinical applicability. Overall, NeuroBridge represents a significant advancement in MRI-based neurodegenerative disease diagnosis, moving beyond traditional single-task approaches to a more integrated, clinically guided model.
Methodology
NeuroBridge employs a transformer-based architecture that first undergoes large-scale self-supervised pretraining using masked autoencoders to learn anatomical representations from MRI data. This is followed by multi-task pretraining that captures various structural cues, including hippocampal segmentation and atrophy classification. The final stage involves fine-tuning through a gated fusion mechanism to integrate these representations into a cohesive diagnostic signal.
Results
NeuroBridge achieved 88.17% accuracy in classifying AD versus cognitively normal controls in the ADNI cohort and 82.78% in the OASIS cohort. The model showed significant improvements in MCI-related and mixed-diagnosis settings, along with strong generalization across different cohorts.
Implications
NeuroBridge's robust performance and ability to integrate multiple diagnostic cues suggest its potential for clinical deployment in neurodegenerative disease assessment. Its capability for opportunistic screening could facilitate earlier detection and intervention in patients, ultimately improving patient outcomes.
Denser $
eq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
NLP
Large Language Models
Reinforcement Learning
- On-policy self-distillation can accelerate specialization but is fragile in continual learning contexts.
- SDPO shows stronger forgetting and potential collapse compared to GRPO.
- Denser self-distillation can amplify noise and artifacts, complicating the learning process.
- The paper distinguishes between on-policy data and the training objective, clarifying their roles in continual learning.
Read more
Denser $
eq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
Summary
This paper investigates the effectiveness of on-policy self-distillation for continual post-training in large language models (LLMs). The authors introduce self-distillation policy optimization (SDPO) and compare it with sequence-level reward optimization (GRPO). While SDPO can enhance in-domain specialization when teacher signals are stable, it struggles with generalization to out-of-distribution scenarios and exhibits increased forgetting and potential collapse during continual learning. The study highlights that denser self-distillation can lead to larger parameter and response drift, reinforcing artifacts rather than useful behavior. The findings suggest that on-policy data alone is insufficient for continual learning, and SDPO should not be viewed as a default stabilizer for continual post-training. The paper emphasizes the need to distinguish between the source of data and the training objective, revealing that while SDPO can improve performance, it also increases sensitivity to noise and artifacts, leading to weaker retention compared to GRPO.
Methodology
The authors conducted experiments comparing SDPO and GRPO in both single-domain and multi-domain continual settings. They varied supervision density and evaluated performance on in-distribution and out-of-distribution benchmarks to assess specialization, retention, and transfer. The analysis included diagnostics for parameter and response drift, collapse modes, and theoretical interpretations.
Results
The results indicate that while SDPO can significantly improve performance on current training domains, it also increases the risk of drift and interference, leading to greater forgetting, especially in misaligned tasks. In contrast, GRPO demonstrated better retention across domains. The study provides insights into how supervision density affects learning signals and the associated risks.
Implications
These findings have implications for the design of continual learning systems, suggesting that reliance on on-policy self-distillation may not be sufficient for effective knowledge retention. The results encourage further exploration of training objectives and data sources in continual learning frameworks.
How Should Transformers Encode Numeric Values in Electronic Health Records?
NLP
Optimization
Time Series
- Introduces a unified evaluation framework for numeric reasoning in EHR transformers.
- Systematically compares discrete, continuous, and hybrid numeric value encodings.
- Demonstrates a precision-stability trade-off in numeric reasoning approaches.
- Finds that transformers can perform approximate numeric computations reliably.
Read more
How Should Transformers Encode Numeric Values in Electronic Health Records?
Summary
This paper investigates the encoding of numeric values in transformer-based models applied to electronic health records (EHR). The authors systematically compare discrete, continuous, and hybrid encoding strategies through a unified evaluation framework that includes synthetic arithmetic tasks and real-world clinical prediction tasks. The study identifies trade-offs between numeric precision, optimization stability, and architectural flexibility. It finds that approaches modeling value-concept interactions yield better performance on precision-sensitive tasks, while hybrid token-based methods that apply binning before projection offer a robust alternative. The results indicate that models can perform reliable approximate numeric computations, suggesting that 'good enough' reasoning may suffice for EHR applications. Additionally, the clinical benefits of incorporating numeric values are modest and task-dependent, emphasizing the importance of robustness over maximal precision in practical settings.
Methodology
The authors developed a reusable test suite to evaluate numeric value encodings in transformer-based EHR models. They conducted empirical studies comparing different encoding strategies using synthetic arithmetic tasks embedded in real EHR data and real-world clinical prediction tasks, analyzing their performance in terms of numeric precision, optimization stability, and architectural flexibility.
Results
The study reveals that approaches explicitly modeling value-concept interactions perform best on precision-sensitive tasks, while hybrid methods with binning provide a robust alternative. All evaluated methods demonstrated reliable approximate numeric reasoning, with performance degrading smoothly as task complexity increased. Clinical gains from numeric value incorporation were found to be modest and task-dependent.
Implications
The findings suggest that hybrid token-based approaches may serve as a practical default for encoding numeric values in EHR applications, prioritizing robustness and deployability over maximal numeric precision. This has implications for the design of future transformer models in clinical settings, particularly in predictive modeling tasks.
Message Passing Based Two-Timescale Bayesian Learning for Joint Channel and Memory Hardware Impairments Tracking
Theory
Optimization
Time Series
- Introduces a two-timescale Bayesian learning framework for joint channel and hardware impairment tracking.
- Utilizes a residual recurrent gated unit (RGRU) to effectively model intra-slot memory of hardware impairments.
- Implements a message-passing algorithm that allows for efficient channel estimation and impairment calibration.
- Demonstrates superior performance in channel estimation error reduction compared to traditional compensators.
Read more
Message Passing Based Two-Timescale Bayesian Learning for Joint Channel and Memory Hardware Impairments Tracking
Summary
This paper addresses the challenge of hardware impairments in massive MIMO receivers, which lead to inter-symbol memory and inter-element coupling, adversely affecting channel estimation. The authors propose a novel framework called Message-Passing Based Two-Timescale Bayesian Deep Learning (MP-TTBDL) that utilizes a residual recurrent gated unit (RGRU) to model the intra-slot memory of hardware impairments. The framework distinguishes between the fast-varying wireless channel and the slow-varying hardware impairments by employing a fast Markov prior for the sparse channel and a slow Gaussian Markov prior for network parameters. A multi-slot factor graph formulation is introduced, where a message-passing algorithm is developed to facilitate the estimation process. The algorithm features closed-form updates for inter-slot messages and partitions the intra-slot factor graph into two modules: a channel tracking module using Turbo-OAMP for sparse channel estimation and an impairments calibration module utilizing a deep approximate message passing (DAMP) procedure. These modules iteratively exchange information until convergence. Simulation results demonstrate that the proposed framework significantly reduces channel estimation error compared to conventional methods across various impairment scenarios and signal-to-noise ratio conditions.
Methodology
The methodology involves a two-timescale Bayesian approach that combines a fast-varying Markov prior for the channel and a slow-varying Gaussian Markov prior for the hardware impairments. The framework employs a multi-slot factor graph and a message-passing algorithm to iteratively update channel estimates and impairment parameters through expectation propagation, utilizing Turbo-OAMP and DAMP techniques.
Results
The simulation results indicate that the MP-TTBDL framework achieves lower channel estimation errors than conventional compensators across different online impairment scenarios and varying signal-to-noise ratios, demonstrating its robustness and effectiveness in real-world applications.
Implications
The proposed framework has significant implications for improving the performance of massive MIMO systems in practical wireless communication environments, particularly in mitigating the effects of hardware impairments. It can enhance the reliability and efficiency of channel estimation processes, leading to better overall system performance.
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
Large Language Models
Efficient ML
Optimization
- DEADPOOL enables hot-swapping of failed nodes in LLM training without job termination.
- The system employs an asynchronous in-memory checkpointing mechanism to achieve zero overhead during error-free execution.
- Recovery from node failures is completed in under 40 seconds, significantly reducing downtime.
- The methodology is evaluated on up to 512 GPUs and models with up to 65 billion parameters.
Read more
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
Summary
The paper presents DEADPOOL, a novel fault-tolerance mechanism designed for large language model (LLM) training that addresses the challenges of node failures in distributed GPU environments. Traditional fault-tolerance methods often incur significant overhead during normal operations or lengthy recovery times after failures. DEADPOOL introduces a hot-swapping approach that allows for the replacement of failed compute nodes with spare nodes without halting the entire training job. This is achieved through an innovative in-memory checkpointing strategy that operates off the critical path of training, enabling zero overhead during error-free execution. Additionally, a communicator reconstruction protocol facilitates the dynamic replacement of nodes at runtime. The authors evaluate DEADPOOL on high-performance computing systems, demonstrating its effectiveness across various scales and model sizes, achieving rapid recovery times of under 40 seconds after node failures while maintaining zero checkpoint overhead during normal operations.
Methodology
DEADPOOL utilizes a two-pronged approach: an asynchronous in-memory checkpointing strategy for optimizer state shards that overlaps with computation, and a distributed communicator reconstruction protocol that allows for the dynamic replacement of failed nodes with spare nodes. The implementation is built on PyTorch and Megatron-LM, leveraging generic capabilities for sharded model and optimizer states.
Results
The evaluation of DEADPOOL on systems like NERSC Perlmutter and TACC Vista shows that it maintains zero checkpoint overhead during normal operations and achieves a consistent recovery time of approximately 40 seconds after node failures, even when scaling up to 512 GPUs and 65 billion parameter models.
Implications
The development of DEADPOOL has significant implications for the training of large-scale language models, particularly in environments where node failures are common. Its ability to minimize downtime and overhead can enhance the efficiency and reliability of LLM training, making it a valuable tool for researchers and practitioners in the field.
Efficient Temporal Point Processes via Monotone Alternating Splines
Time Series
Efficient ML
Theory
- Identifies fundamental limitations of Monotone Neural Networks in CCIF modeling.
- Introduces Monotone Alternating Splines (MAS) to enhance flexibility and computational efficiency.
- Establishes a theoretical foundation for MAS, including generalization error analysis.
- Demonstrates superior performance of MAS on synthetic and real-world datasets.
Read more
Efficient Temporal Point Processes via Monotone Alternating Splines
Summary
This paper addresses the limitations of existing Monotone Neural Networks (MNNs) in modeling Cumulative Conditional Intensity Functions (CCIFs) for Temporal Point Processes (TPPs). The authors identify three structural deadlocks in MNNs: convexity restrictions, saturation limits, and violations of CCIF requirements, which hinder their ability to capture complex temporal dynamics. To overcome these challenges, they propose a new framework called Monotone Alternating Splines (MAS). MAS utilizes piecewise monotone splines for interpolation, allowing for accurate fitting of complex TPP sequences, while a separate extrapolation component ensures global monotonicity and robust generalization. The paper establishes a theoretical foundation for MAS, analyzing its generalization error and proving its superior fitting capabilities compared to MNNs. Extensive experiments demonstrate that MAS outperforms existing methods on both synthetic and real-world datasets, highlighting its potential for efficient modeling of TPPs.
Methodology
The authors propose Monotone Alternating Splines (MAS) which consists of two components: an interpolation component using piecewise monotone splines for accurate fitting of CCIFs, and an extrapolation component that maintains global monotonicity using a simple monotonically increasing function. The framework is theoretically analyzed for its generalization capabilities and approximation errors.
Results
MAS achieves superior performance compared to existing MNN-based methods, effectively modeling complex temporal dynamics in TPPs. The theoretical analysis confirms that MAS has better fitting and generalization capabilities, supported by extensive experimental results on various datasets.
Implications
The MAS framework can significantly improve the efficiency and accuracy of modeling temporal point processes across various domains such as finance, social networks, and neuroscience, where understanding event sequences is crucial.
Decomposer: Learning to Decompile Symbolic Music to Programs
Audio & Speech
Reinforcement Learning
Generative Models
- DECOMPOSER effectively converts MIDI to Strudel, addressing the inverse problem of musical instruction recovery.
- The framework utilizes a two-stage training process: supervised fine-tuning followed by reinforcement learning.
- A synthetic dataset, STRUDEL-SYNTH, is created to facilitate the supervised learning phase.
- The model achieves superior performance in both MIDI reconstruction fidelity and code readability compared to existing methods.
Read more
Decomposer: Learning to Decompile Symbolic Music to Programs
Summary
The paper introduces DECOMPOSER, a novel framework aimed at the challenging task of decompiling symbolic music into executable and editable programs. Specifically, it focuses on converting MIDI files into Strudel, a domain-specific music programming language. The authors identify two main challenges: the scarcity of paired MIDI and Strudel data, and the risk of generating unreadable code through simple transliteration. To address these, they propose a two-stage approach. The first stage involves creating STRUDEL-SYNTH, a synthetic dataset of paired MIDI and Strudel programs for supervised fine-tuning. The second stage employs reinforcement learning to optimize the model for both MIDI reconstruction fidelity and code readability. The evaluation demonstrates that DECOMPOSER significantly outperforms existing models, achieving higher fidelity in MIDI reconstruction and producing more readable code than heuristic converters.
Methodology
The methodology consists of a two-stage training pipeline. First, supervised fine-tuning is conducted using a synthetic dataset of paired MIDI and Strudel programs (STRUDEL-SYNTH). Second, reinforcement learning is applied to optimize the decompilation objective, where the model samples candidate programs, executes them, and receives rewards based on MIDI reconstruction faithfulness and code readability.
Results
DECOMPOSER demonstrates substantial improvements in MIDI-to-Strudel decompilation, achieving higher MIDI reconstruction fidelity and producing more readable and diverse code compared to closed-source large language models and heuristic converters. The evaluation is conducted on both synthetic and real-world MIDI benchmarks.
Implications
The findings suggest that DECOMPOSER can enhance algorithmic composition and live coding practices by providing musicians and developers with tools to easily manipulate and generate music programmatically. This work also opens avenues for further research in symbolic music processing and decompilation across various domains.