AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
Large Language Models
NLP
Interpretability
- DUALEVAL unifies static benchmark correctness and open-ended preference signals for LLM evaluation.
- The framework jointly estimates model abilities and item properties, enhancing evaluation stability and interpretability.
- Empirical results show balanced model rankings and robust performance across various domains.
- DUALEVAL enables diagnostic applications like benchmark compression and anomaly detection.
Read more
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
Summary
The paper introduces DUALEVAL, a novel framework for evaluating large language models (LLMs) that integrates two distinct evaluation signals: static benchmarks with objective correctness labels and arena-style preference data reflecting user interactions. DUALEVAL employs a latent model-item calibration approach, estimating model abilities alongside item difficulty and sharpness on a shared scale. The framework is applied across four domains—coding, math, miscellaneous domain knowledge tasks, and everyday user queries—utilizing 18 frontier LLMs. The results demonstrate that DUALEVAL produces reliable model rankings and item-level diagnostics, enhancing evaluation efficiency and interpretability. The framework also supports applications such as benchmark compression and anomaly detection, showcasing its potential to improve LLM evaluation pipelines by unifying static and preference-based assessments.
Methodology
DUALEVAL utilizes a joint calibration framework inspired by Item Response Theory (IRT), employing a two-parameter logistic model for static benchmarks and reward-model scores for arena-style evaluations. This allows for the simultaneous estimation of model abilities and item characteristics, facilitating a comprehensive evaluation of LLMs.
Results
The framework achieved 88-92% accuracy in static-label domains and 68-81% agreement in arena comparisons. Compared to static-only and arena-only baselines, DUALEVAL produced more balanced rankings and demonstrated robustness to variations in reward models. Additionally, it effectively identified high-signal items and detected anomalies with high accuracy.
Implications
DUALEVAL's integration of static and preference-based evaluation signals can lead to more efficient and interpretable LLM evaluation processes. Its diagnostic capabilities may assist in identifying model weaknesses and improving future model development and evaluation methodologies.
WattLayer: Get Layers Right to Estimate Inference Energy of Neural Networks
Efficient ML
- Introduction of WattLayer, a task-independent layer-wise energy estimation model.
- Evaluation on a dataset of over 100,000 layers from 295 architectures, achieving a median error of 19.6%.
- Demonstration of zero-shot generalization to new tasks without retraining.
- Development of a comprehensive dataset and rigorous experimental protocol for energy measurement.
Read more
WattLayer: Get Layers Right to Estimate Inference Energy of Neural Networks
Summary
The paper presents WattLayer, a novel layer-wise energy estimation model for neural networks that addresses the growing concern of energy consumption in AI systems. The authors highlight the lack of standardized methodologies for estimating inference energy across various tasks and architectures. WattLayer is evaluated on a comprehensive dataset comprising over 100,000 layers from 295 neural network architectures across three tasks and three hardware platforms. The model achieves a median error of 19.6%, outperforming existing state-of-the-art methods. Importantly, WattLayer demonstrates the ability to generalize to new tasks without requiring complete retraining, leveraging shared layers across different architectures. This work provides a rigorous experimental protocol, an extensive dataset, and a Python-based framework for layer-wise decomposition, enabling stakeholders to design more energy-efficient AI systems.
Methodology
The authors developed a layer-wise energy estimation methodology called WattLayer, which decomposes neural network architectures to a granular level, allowing for accurate energy consumption predictions without task-specific dependencies. They established a rigorous experimental protocol for measuring energy consumption and created a Python framework for extracting layer-wise characteristics from PyTorch architectures.
Results
WattLayer achieved a median error of 19.6% in energy estimation, outperforming state-of-the-art models. The model also demonstrated the ability to generalize to new tasks with a Mean Absolute Percentage Error (MAPE) of ≤30% for large language models, showcasing its adaptability and accuracy.
Implications
The findings suggest that WattLayer can serve as a reliable tool for estimating energy consumption in AI systems, enabling developers to make informed decisions about energy efficiency during the design and deployment phases. This could lead to more sustainable AI practices and reduced energy consumption in large-scale AI applications.
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
Theory
Optimization
- Introduces a novel framework for analyzing generalization scaling laws in quadratic neural networks.
- Characterizes generalization error as a function of model width, sample size, and regularization in a finite-sample setting.
- Identifies distinct scaling regimes and transitions that affect generalization performance.
- Demonstrates the influence of data structure on generalization through power-law relationships.
Read more
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
Summary
This paper investigates the relationship between model size, data availability, and generalization performance in quadratic neural networks. Unlike previous studies that primarily focused on fixed-feature or infinite-width regimes, this research analyzes how generalization scales with both the number of trainable parameters and the number of training samples in a feature-learning context. The authors employ ℓ2-regularized empirical risk minimization in a quadratic two-layer network, allowing for a detailed characterization of generalization error as a function of model width, sample size, and regularization. The findings reveal a phase diagram with distinct scaling regimes influenced by the spectral structure of the target data, highlighting the role of data-dependent power laws in determining generalization error. The study also identifies transitions between different regimes, including the onset of interpolation, and their implications for generalization performance.
Methodology
The authors utilize ℓ2-regularized empirical risk minimization in a quadratic two-layer neural network model. They analyze the generalization error by reformulating the problem into a matrix optimization framework, allowing for a clear examination of how model width, sample size, and regularization interact to influence performance. The model is trained on high-dimensional data with a power-law structure, enabling a theoretical analysis of the scaling behavior of the excess test error.
Results
The study presents a comprehensive phase diagram illustrating how the excess test error scales with the number of parameters and samples. The results indicate that the generalization error follows data-dependent power laws, with specific scaling behaviors identified for varying model widths and sample sizes. The optimal width for networks is shown to coincide with the Bayes optimal rate, confirming the theoretical predictions made in the paper.
Implications
The findings have significant implications for the design of neural networks, particularly in understanding how to optimize model size and data usage to enhance generalization performance. This research could inform future work in model architecture design and training strategies, particularly in contexts where data availability is limited.
Uncertainty quantification via conformal prediction in data assimilation
Theory
Time Series
Efficient ML
- Conformal prediction provides statistically rigorous uncertainty quantification with guaranteed coverage.
- The study evaluates three variants of CP in a controlled atmospheric model setting.
- CP-derived uncertainty estimates are compared with traditional ensemble-based measures.
- Results highlight the strengths and limitations of CP in the context of data assimilation.
Read more
Uncertainty quantification via conformal prediction in data assimilation
Summary
This paper explores the application of conformal prediction (CP) for uncertainty quantification in numerical weather prediction, particularly within the context of data assimilation. The authors focus on a one-dimensional modified shallow water model to simulate convective processes and evaluate three variants of CP: Standard CP, Normalized CP, and Conformalized Quantile Regression. The study compares these CP methods against traditional ensemble-based uncertainty measures, such as standard deviation intervals and ensemble spread. The results indicate that CP can effectively provide uncertainty estimates with guaranteed coverage, addressing limitations of ensemble methods, which often require large computational resources and can suffer from sampling errors. The integration of CP-derived uncertainty into the data assimilation cycle is also examined, revealing insights into the strengths and weaknesses of each approach. Overall, the findings suggest that CP can complement existing ensemble-based methods, enhancing uncertainty quantification in atmospheric models.
Methodology
The authors employed a one-dimensional modified shallow water model to simulate atmospheric conditions and applied three variants of conformal prediction to quantify uncertainty. They compared the performance of CP methods against traditional ensemble-based uncertainty measures by evaluating metrics such as average empirical coverage, interval length, and average interval score loss (AISL). Additionally, they investigated the integration of CP-derived uncertainty within the data assimilation cycle.
Results
The results demonstrated that conformal prediction methods provided reliable uncertainty estimates with guaranteed coverage, outperforming traditional ensemble-based measures in certain aspects. The analysis revealed that while CP has strengths in providing formal reliability guarantees, it also has limitations that need to be considered when integrating it into data assimilation processes.
Implications
The findings suggest that conformal prediction could be a valuable tool for enhancing uncertainty quantification in operational meteorology and data assimilation, particularly in high-impact weather forecasting scenarios. By providing reliable uncertainty estimates, CP can improve decision-making processes in meteorological applications.
Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry
Interpretability
- Introduces a unified evaluation framework for unsupervised athlete ranking.
- Demonstrates the effectiveness of deep autoencoders in reducing dimensionality of performance data.
- Establishes a composite criterion for model selection that incorporates both reconstruction accuracy and interpretability.
- Identifies key performance indicators such as running pace and heart rate as dominant factors in the latent score.
Read more
Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry
Summary
This paper investigates the use of various autoencoder architectures for reducing the dimensionality of athlete performance data collected from wearable devices. The authors evaluate five models, including three autoencoder variants, PCA, and a Variational Autoencoder, to compress nine sensor profiles into a single scalar performance indicator known as the latent score. The evaluation is conducted in an unsupervised manner, focusing on reconstruction error and interpretability of the latent score. The study introduces a composite selection criterion that balances reconstruction accuracy with interpretability, assessed through multiple metrics such as Spearman and Kendall rank correlations, Mutual Information, and Permutation Importance. The results indicate that the Deep Autoencoder achieved the lowest reconstruction error and the highest composite score, while the performance of PCA improved with wider hidden layers. Key drivers of the latent score were identified as running pace, aerobic decoupling, and average heart rate, aligning with established physiological principles. This work addresses the need for a unified framework that compresses multi-sensor telemetry into a single interpretable score while preserving nonlinear relationships and providing stable rankings.
Methodology
The authors evaluated five dimensionality reduction models, including three autoencoder variants (Simple AE, Medium AE, Deep AE), PCA, and a Variational Autoencoder. They assessed model quality using reconstruction error and interpretability metrics, combined into a composite selection criterion. Feature rankings were aggregated using a modified Borda count, and stability was confirmed through bootstrap validation.
Results
The Deep Autoencoder achieved the lowest reconstruction error and the highest composite score for interpretability. When PCA's hidden layers were widened, its performance became competitive with the Deep AE, indicating that hidden layer capacity was a limiting factor. Key latent score drivers included running pace, aerobic decoupling, and average heart rate.
Implications
This research has potential applications in sports analytics, providing coaches with a compact and interpretable performance indicator derived from complex wearable telemetry data. It enhances the understanding of athlete performance dynamics and could inform training strategies.
Decision-Aligned Evaluation of Uncertainty Quantification
Theory
- Introduces decision-alignment as a criterion for evaluating UQ metrics.
- Demonstrates that many traditional UQ metrics are misaligned with decision-making utilities.
- Proposes prior-weighted utility metrics for better alignment with downstream decisions.
- Shows through experiments that prior-weighted metrics outperform conventional metrics in aligning with decision utility.
Read more
Decision-Aligned Evaluation of Uncertainty Quantification
Summary
This paper addresses the evaluation of uncertainty quantification (UQ) in machine learning, highlighting that traditional metrics like negative log-likelihood (NLL) and expected calibration error (ECE) do not necessarily correlate with the utility of decisions made based on these uncertainties. The authors introduce a new framework called decision-alignment, which assesses how well UQ metrics align with downstream decision-making utilities. They demonstrate that many common UQ metrics are misaligned or encode flawed prior beliefs about the decision tasks. To address this, the authors propose prior-weighted utility metrics, which are designed to provide a more accurate evaluation of uncertainty in relation to decision-making. Through extensive experiments in classification and regression, the study shows that these new metrics consistently align with actual decision utilities, unlike conventional metrics. This work critiques existing UQ evaluation protocols and suggests a principled approach to enhance decision-relevant UQ assessments.
Methodology
The authors conducted a systematic analysis of common UQ metrics using the decision-alignment framework. They proposed prior-weighted utility metrics and validated their effectiveness through benchmark experiments and real-world case studies in classification and regression tasks.
Results
The study found that prior-weighted utility metrics reliably align with real downstream utilities, while conventional metrics like NLL and ECE often do not. The results highlighted flaws in existing UQ evaluation protocols and demonstrated the superiority of the proposed metrics in reflecting decision-making utility.
Implications
This research has significant implications for the development and evaluation of UQ methods in machine learning, particularly in safety-critical applications. By aligning UQ evaluation with decision-making, it encourages the adoption of more effective metrics that can lead to better decision outcomes in practice.
What Survives When You Compress a Recursive Reasoner for the Edge?
Efficient ML
- Aggressive compression preserves local prediction accuracy but severely impacts global reasoning accuracy.
- The collapse in reasoning performance is architectural, affecting MLP-mixing recursion but not attention mechanisms.
- Carry-trajectory fidelity serves as a label-free signal to predict reasoning damage and recovery.
- A deployment recipe is proposed that allows for efficient model compression suitable for edge hardware.
Read more
What Survives When You Compress a Recursive Reasoner for the Edge?
Summary
This paper investigates the effects of compression on recursive reasoning models, which are capable of solving complex structured tasks with relatively few parameters by iteratively updating a latent state. The authors highlight that traditional compression techniques, which work well for conventional sequence models, fail for recursive models due to the compounding of quantization errors across recursive cycles. Through a systematic study involving various compression techniques (including INT4, pruning, and distillation) and architectures (Tiny Recursive Model and Hierarchical Recursive Model), the authors find that while local prediction accuracy remains intact, global reasoning accuracy collapses under aggressive compression. They introduce a novel metric, carry-trajectory fidelity, which predicts the degradation of reasoning performance without requiring task labels. The study culminates in a deployment strategy that enables efficient model compression while maintaining performance on edge hardware, demonstrating that calibrated INT4 can fit within a 4 MB microcontroller and INT8 can achieve full-depth accuracy with significantly reduced computational requirements.
Methodology
The authors conducted a full precision sweep across three tasks (ARC-2024, Maze-Hard, Sudoku-Extreme) and two recursive architectures (TRM and HRM). They evaluated various compression techniques, including INT4, pruning, distillation, and quantization-aware training, while measuring both local and global accuracy to understand the effects of compression on recursive reasoning models.
Results
The study revealed that under naïve INT4 compression, local accuracy remains high, but global puzzle accuracy drops to zero. The authors established a depth-precision crossover, showing that INT8 at a single recursive cycle can match full-depth FP32 accuracy with significantly fewer FLOPs. They also demonstrated that calibrated INT4 could fit a 4 MB microcontroller while maintaining performance.
Implications
The findings suggest that careful consideration of architecture and compression techniques is crucial for deploying recursive reasoning models on resource-constrained edge devices. The introduction of carry-trajectory fidelity could enhance model evaluation and optimization processes in future research.
What Was That Again? Certified Robustness for Automatic Speech Recognition
Audio & Speech
- Introduces a dual-gate certification pipeline for ASR systems to enhance robustness against perturbations.
- Achieves up to a 55% reduction in Word Error Rate (WER) across multiple ASR architectures.
- Provides both atomic and structural guarantees for sequence certification without requiring sentence alignment.
- Demonstrates significant improvements in recall and reduced correlation between confidence scores and WER.
Read more
What Was That Again? Certified Robustness for Automatic Speech Recognition
Summary
This paper addresses the vulnerability of Automatic Speech Recognition (ASR) systems to both adversarial and benign perturbations, which can significantly degrade transcription accuracy. The authors propose a novel framework that employs a certification-inspired mechanism to enhance the robustness of ASR systems. Their approach includes a dual-gate diagnostic pipeline consisting of a Two-Sided Atomic Audit for certifying token existence and adversarial exclusion, and a Rank-Based Tournament for selecting the best transcription sequence. The proposed method demonstrates a substantial reduction in Word Error Rate (WER) and improved recall, while also decreasing the correlation between confidence scores and WER. Evaluations across four different ASR architectures show up to a 55% relative reduction in WER, alongside granular certifications at both word and sentence levels, thereby enhancing the acoustic security of ASR systems.
Methodology
The authors developed a dual-gate diagnostic pipeline that includes a Two-Sided Atomic Audit for certifying the existence of tokens and excluding adversarial examples, and a Rank-Based Tournament for selecting the optimal transcription sequence. This approach leverages E-values and Ville's Inequality to provide valid safety radii throughout the sampling process, allowing for efficient certification.
Results
The proposed framework resulted in a relative reduction of up to 55% in WER across four ASR architectures. The certification mechanism maintained a stable recall rate between 40.5% and 90.3% even under high noise levels, where traditional methods failed. Additionally, the authors demonstrated the effectiveness of their approach in auditing ASR performance on datasets like LibriSpeech and Common Voice.
Implications
The findings suggest that the proposed certification framework can significantly enhance the reliability and security of ASR systems, making them more robust to adversarial attacks and benign perturbations. This has important implications for deploying ASR in safety-critical applications, such as voice-activated systems in healthcare and automotive industries.
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Theory
- Introduces Hausdorff distance as a metric for comparing differential equations.
- Establishes identifiability bounds for a wide class of ODEs.
- Quantifies sample complexity required for reliable recovery of governing equations.
- Addresses theoretical gaps in the uniqueness and stability of ODE identification.
Read more
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Summary
This paper addresses the challenge of recovering governing ordinary differential equations (ODEs) from observed solution data, a critical issue in scientific machine learning. The authors highlight the lack of theoretical foundations regarding the unique and stable identification of ground-truth ODEs from multiple observations. To tackle this, they introduce the Hausdorff distance as a metric for comparing differential equations, which captures the worst-case separation between solution trajectories. The paper establishes identifiability bounds for a broad class of ODEs, including linear and nonlinear equations with Lipschitz continuous vector fields. By analyzing the Hausdorff distance, the authors derive sample complexity bounds, quantifying the number of solution observations required for reliable recovery of governing equations. The results provide a theoretical foundation for future work in identifying governing equations from data, paving the way for advancements in both linear and nonlinear ODE identification.
Methodology
The authors formalize the identification problem by measuring the distance between ODEs using the Hausdorff distance of their solution sets. They derive upper and lower bounds on this distance for various classes of ODEs and analyze the implications for sample complexity and metric entropy.
Results
The paper presents identifiability bounds that characterize when distinct ODEs can be differentiated based on solution data. It also provides metric entropy estimates for the classes of ODEs studied, leading to quantifiable sample complexity results that indicate the number of observations needed for effective learning.
Implications
The findings have significant implications for scientific machine learning, particularly in fields where understanding governing equations is crucial, such as physics and engineering. The theoretical framework established can guide future research in both linear and nonlinear systems identification.
Boundary condition fidelity for bottom-hole pressure and CO2 plume prediction in geological carbon storage
Theory
Optimization
Time Series
- Boundary condition fidelity is critical for accurate BHP and CO2 plume predictions in GCS.
- Uniform treatments that ignore corner storage lead to substantial pressure errors and plume misrepresentation.
- Corner-adjusted boundary conditions significantly enhance prediction accuracy.
- The gradual modifier with transmissibility correction offers the best performance across different reservoir types.
Read more
Boundary condition fidelity for bottom-hole pressure and CO2 plume prediction in geological carbon storage
Summary
This study investigates the impact of boundary condition fidelity on the prediction of bottom-hole pressure (BHP) and CO2 plume migration in geological carbon storage (GCS). The authors compare ten different boundary treatments in reduced-domain simulations against full-domain reference simulations in both homogeneous and heterogeneous reservoirs. The treatments include uniform pore-volume multipliers, transmissibility modifiers, corner-adjusted pore-volume corrections, layered corrections, and gradual modifiers. The evaluation uses metrics such as BHP root mean square error (RMSE), normalized RMSE (NRMSE), peak pressure deviation, and plume Intersection over Union (IoU). The findings reveal that preserving corner pore volume is crucial for accurate modeling, as uniform treatments that neglect corner storage lead to significant pressure errors and misrepresentation of plume areas. Corner-adjusted scenarios significantly improve accuracy, achieving higher IoU values, while the gradual modifier with transmissibility correction consistently performs well across reservoir types. This research provides essential insights for optimizing boundary conditions in GCS simulations, which are vital for regulatory compliance and operational safety.
Methodology
The authors constructed numerical models of homogeneous and heterogeneous reservoirs and evaluated ten boundary condition treatments against full-domain reference simulations. Performance was quantified using BHP time series, peak pressure deviation, NRMSE, and plume area metrics throughout the injection and post-injection monitoring periods.
Results
The study found that uniform treatments resulted in BHP RMSE values of 362-382 psi for homogeneous models and 250-304 psi for heterogeneous models, with plume IoU values around 0.80-0.84. Corner-adjusted scenarios reduced pressure errors significantly, raising IoU above 0.94. The gradual modifier with transmissibility correction achieved BHP NRMSE below 3.7% and plume IoU above 0.97 in both reservoir types.
Implications
The results provide practical guidance for selecting boundary conditions in GCS simulations, which can enhance the safety and economic viability of carbon storage projects. Improved modeling fidelity can lead to better regulatory compliance and operational monitoring.
When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence
Multimodal
- Introduces a leakage-safe diagnostic to assess the influence of quality signals on multimodal predictions.
- Finds that permuting quality scores does not significantly degrade model performance, indicating minimal reliance on these scores.
- Demonstrates that quality-aware fusion is beneficial only when quality estimates accurately identify reliable modalities.
- Highlights the importance of distinguishing between correlation and causation in multimodal systems.
Read more
When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence
Summary
This paper investigates the role of quality-aware multimodal fusion in decision-making processes, specifically examining whether reliability scores of different modalities influence model predictions or merely correlate with performance. The authors propose a novel diagnostic method that tests the decision-level dependence on quality signals by permuting reliability scores across test examples while keeping the model and inputs fixed. The experiments conducted on the StressID dataset for stress recognition and the CMU-MOSEI dataset for sentiment analysis reveal that shuffling the quality scores does not significantly affect performance, indicating that the models do not rely on these scores for decision-making. In contrast, when quality signals are aligned with unimodal correctness, substantial performance improvements are observed. This suggests that quality-aware fusion is effective only when the quality estimates accurately predict the reliability of the corresponding modality.
Methodology
The authors developed a diagnostic framework that separates modality evidence, availability, and quality signals. They conducted experiments by fixing the model and inputs while permuting quality scores across test examples to evaluate the impact on performance. The analysis focused on fully observed instances to isolate the effects of quality from missing modalities.
Results
The results showed that shuffling the native quality signals led to negligible changes in performance on both StressID and CMU-MOSEI datasets, despite the potential for improved routing. In positive control scenarios where quality signals were aligned with unimodal correctness, significant performance improvements were observed, confirming that the effectiveness of quality-aware fusion depends on the reliability of the quality estimates.
Implications
The findings suggest that multimodal systems should focus on improving the accuracy of quality estimates to enhance decision-making. This research could inform the design of more robust multimodal systems, particularly in applications such as affect recognition and sentiment analysis, where the reliability of different modalities can vary significantly.
Finding Stationary Points by Comparisons
Optimization
Theory
- Developed an algorithm for finding ϵ-stationary points using a comparison oracle with eO(n²/ϵ¹.⁵) queries.
- Introduced a quantum algorithm that finds ϵ-stationary points with eO(n/ϵ¹.⁵) queries.
- Improved dependence on ϵ compared to previous methods while incurring a higher cost in terms of dimensionality.
- Identified the limitations of the comparison oracle model in accessing gradient information.
Read more
Finding Stationary Points by Comparisons
Summary
This paper addresses the challenge of finding stationary points of non-convex functions using a comparison oracle, which only provides information on the relative values of function evaluations at two points. The authors propose an algorithm that requires eO(n²/ϵ¹.⁵) queries to identify an ϵ-stationary point for functions with Lipschitz gradients and Hessians. The algorithm includes a subroutine for estimating the normalized Hessian with eO(n² log(1/δ)) queries. Additionally, the authors explore a quantum comparison oracle model, presenting the first quantum algorithm that achieves the same goal with eO(n/ϵ¹.⁵) queries. The results indicate that their approach improves the dependence on ϵ compared to existing methods, although it incurs a higher dimensionality cost. The paper also highlights the limitations of the comparison oracle model, particularly in accessing gradient information directly, which poses challenges for confirming the exact nature of stationary points.
Methodology
The authors utilized a comparison oracle that outputs which of two points has a larger function value to develop their algorithm. They also incorporated a subroutine for estimating the normalized Hessian and explored a quantum model for making queries in superpositions. The algorithm's performance was analyzed in terms of query complexity relative to the parameters of the function being optimized.
Results
The proposed algorithm guarantees that one of the queried points is an ϵ-stationary point with high probability, achieving a query complexity of eO(n²/ϵ¹.⁵). The quantum algorithm further reduces this complexity to eO(n/ϵ¹.⁵). The results demonstrate improved efficiency in finding stationary points compared to existing methods, although the n² dependence raises questions about optimality in certain regimes.
Implications
The findings have significant implications for optimization problems in machine learning, particularly in scenarios where only relative comparisons of function values are available, such as preference-based reinforcement learning. The results could enhance optimization techniques in various applications, including tensor decomposition and matrix completion.
A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
Multimodal
Time Series
- Systematic comparison of seven fusion techniques for multi-modal HAR on a common dataset.
- Gated Multi-modal Fusion achieved the highest macro F1-score of 0.82.
- Identified the contribution of each modality to overall performance.
- Demonstrated that multi-modal fusion can reduce the impact of handedness on recognition accuracy.
Read more
A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
Summary
This paper addresses the gap in empirical comparisons of multi-modal sensor fusion techniques for Human Activity Recognition (HAR) by systematically evaluating seven state-of-the-art fusion methods on the HARMES dataset. The dataset includes 61 hours of labeled data from IMUs, audio, and ambient humidity sensors, focusing on 15 activities of daily living. The authors apply these fusion techniques to a consistent multi-modal model architecture, revealing that Gated Multi-modal Fusion outperforms other methods, achieving a macro F1-score of 0.82, which is a significant improvement over the baseline score of 0.76. The study also explores the contribution of each modality to performance and demonstrates that multi-modal fusion can help mitigate issues related to handedness in activity recognition. The findings provide valuable insights for researchers selecting fusion strategies for multi-modal HAR systems.
Methodology
The authors conducted a systematic evaluation of seven fusion strategies under identical conditions using the HARMES dataset. They employed three encoder backbones, maintained consistent training hyperparameters, and utilized a 10-second window size for data processing. The evaluation was performed using 3-fold group cross-validation and 20-fold leave-one-participant-out (LOPO) methods to ensure robustness and comparability.
Results
The Gated Multi-modal Fusion method achieved the highest macro F1-score of 0.82, surpassing the baseline score of 0.76 from concatenation-based late fusion. The study also provided insights into the performance contributions of different modalities and showed that multi-modal approaches can help address challenges related to handedness in activity recognition.
Implications
The findings suggest that researchers and practitioners in the field of HAR can benefit from using multi-modal fusion techniques to improve recognition accuracy, especially in real-world applications where sensor variability and environmental factors play a significant role. The study also highlights the importance of selecting appropriate fusion strategies based on empirical evidence.
Training Observable Control Policies to Expose Agent State Through Actions
Reinforcement Learning
Robotics
Theory
- Introduces a method for estimating agent states using only observable actions.
- Implements a reinforcement learning framework to enhance policy observability.
- Demonstrates improved estimator performance with minimal impact on task performance.
- Focuses on applications in autonomous agent coordination under communication limitations.
Read more
Training Observable Control Policies to Expose Agent State Through Actions
Summary
This paper addresses the challenge of coordinating autonomous agents in scenarios with limited or absent communication. The authors propose a novel approach that leverages the actions taken by agents as a source of information to estimate their states, thereby enhancing observability. By formulating the problem as a partially observable Markov Decision Process (POMDP), the authors develop a reinforcement learning framework that trains control policies to improve the estimation of an agent's state based solely on its actions. The study focuses on an aircraft tracking problem where the trained policy is evaluated for its ability to maintain nominal task performance while simultaneously improving the quality of state estimation. The results demonstrate that the observable control policies significantly enhance estimator performance with minimal impact on the primary task, suggesting that implicit communication through observable actions can facilitate better coordination among agents in communication-constrained environments.
Methodology
The authors utilize a reinforcement learning approach to train control policies that reward observable actions, thereby improving the estimation of an agent's state. The methodology includes formulating the problem as a POMDP and employing Monte Carlo simulations to analyze the performance of the trained policies.
Results
The study finds that the observable control policies lead to a significant improvement in the performance of the state estimator, as evidenced by singular value decomposition analysis of the observability matrix. The trained policies maintain nominal task performance while enhancing the quality of state estimation.
Implications
This work has potential applications in autonomous systems where communication is limited, such as underwater vehicles or aerial drones. By enabling agents to infer states from observable actions, the approach can enhance safety and coordination in mixed human-autonomous operations.
Difference of Convex Programming in the Wasserstein Space with Applications to MMD Optimization
Optimization
Theory
Generative Models
- Introduction of the Wasserstein Convex-Concave Procedure (WCCCP) for optimizing non-convex functionals.
- Theoretical guarantees of almost stationarity for the proposed optimization scheme.
- Empirical results show faster and more stable convergence compared to Wasserstein gradient descent.
- Focus on Maximum Mean Discrepancy (MMD) and Energy Distance (ED) functionals.
Read more
Difference of Convex Programming in the Wasserstein Space with Applications to MMD Optimization
Summary
This paper addresses the optimization of non-convex functionals over the Wasserstein space, particularly focusing on the Maximum Mean Discrepancy (MMD) and Energy Distance (ED) functionals. The authors introduce a new optimization scheme called the Wasserstein Convex-Concave Procedure (WCCCP), which extends the classical convex-concave procedure to the Wasserstein space. They demonstrate that many practical objectives can be decomposed into a difference of convex functions, allowing for a more effective optimization approach. The theoretical analysis shows that under certain smoothness and strong convexity conditions, the algorithm exhibits almost stationary behavior along its iterates. Empirical results indicate that the proposed WCCCP outperforms traditional Wasserstein gradient descent methods in terms of convergence speed and stability when applied to MMD objectives. This work contributes to the broader field of optimization in machine learning by providing a robust framework for handling non-convex problems in the context of probability measures.
Methodology
The authors develop the WCCCP by analyzing the difference of convex (DC) decomposition of objectives in the Wasserstein space. They prove theoretical convergence results under specific conditions and conduct empirical evaluations to compare the performance of WCCCP with traditional optimization methods like Wasserstein gradient descent.
Results
The paper establishes that the WCCCP can achieve almost stationary iterates under certain assumptions and demonstrates through experiments that it provides faster and more stable convergence for MMD objectives compared to existing methods.
Implications
The findings suggest that the WCCCP can be a valuable tool for optimizing complex non-convex functionals in various machine learning applications, particularly in generative modeling and statistical inference, where Wasserstein distances are commonly employed.
The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
Interpretability
Theory
Large Language Models
- Activation patching's natural indirect effect (NIE) includes hidden interaction effects (INT) that can misrepresent component importance.
- INT varies with the distance between clean and patched activations and is negligible in locally affine models.
- The presence of INT explains known failures in activation patching, particularly in the GPT-2 IOI circuit.
- Ranking components solely by pure indirect effect (PIE) can lead to significant inaccuracies.
Read more
The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
Summary
This paper investigates the limitations of activation patching, a method used in mechanistic interpretability to attribute causal responsibility in neural networks. The authors re-derive the natural indirect effect (NIE) from causal mediation analysis and reveal that it not only captures the causal effect through specific components but also includes interaction effects (INT) that depend on the states of other components. The study demonstrates that these interaction effects can lead to misleading conclusions about component importance, particularly in transformer models like GPT-2. The authors provide theoretical proofs and empirical evidence showing that INT scales with the distance between clean and patched activations and decomposes into individual and group interactions. They argue that INT is a fundamental aspect of causal mediation that should be embraced rather than eliminated, as it offers diagnostic insights into interpretability studies. The findings highlight the necessity of combinatorial search for accurate causal attribution and the limitations of relying solely on NIE for ranking component importance.
Methodology
The authors conducted a theoretical analysis of the natural indirect effect (NIE) and interaction effects (INT) in the context of activation patching, supported by empirical evaluations on the GPT-2 IOI circuit. They derived mathematical proofs and analyzed the scaling behavior of INT in relation to component activations.
Results
The study found that NIE and PIE yield different rankings of component importance, with correlation coefficients as low as 0.51. INT was shown to scale with activation differences and contributed to the instability of faithfulness scores. The authors also identified that interaction effects explain the low NIE scores of certain components that are conditionally important, revealing a complex interplay in causal attribution.
Implications
The findings suggest that practitioners in mechanistic interpretability should account for interaction effects when assessing component importance in neural networks. This could lead to more accurate models of causal responsibility and better understanding of model behavior, particularly in complex architectures like transformers.
Zero-Shot Size Transfer for Neural ODEs on Sparse Random Graphs: Graphon Limits and Adjoint Convergence
Graph Learning
Theory
Efficient ML
- Establishes a quantitative theory for zero-shot size transfer in GNDEs on sparse random graphs.
- Proves trajectory-wise convergence of GNDE solutions to Graphon-NDE solutions with a specific convergence rate.
- Demonstrates asymptotic consistency of DTO and OTD training paradigms for GNDEs.
- Validates theoretical findings through experiments on various graphon classes.
Read more
Zero-Shot Size Transfer for Neural ODEs on Sparse Random Graphs: Graphon Limits and Adjoint Convergence
Summary
This paper investigates the zero-shot size transfer principle for Graph Neural Differential Equations (GNDEs), which model continuous-time dynamics on graphs using Neural ODEs parameterized by Graph Neural Networks (GNNs). The authors develop a quantitative theory for this principle, focusing on sparse random graphs sampled from graphons. They introduce Graphon Neural Differential Equations (Graphon-NDEs) and adjoint Graphon-NDEs as the infinite-node limits of GNDE systems, establishing their well-posedness. The paper proves that for an n-node random graph with sparsity parameter αn, the trajectory-wise convergence of GNDE solutions to Graphon-NDE solutions occurs at a rate of O((αnn)−1/2) with high probability. Additionally, uniform-in-time convergence bounds for adjoint systems governing hidden-state and parameter gradients are established. The authors analyze two training paradigms: discretize-then-optimize (DTO) and optimize-then-discretize (OTD), demonstrating their asymptotic consistency under explicit Euler discretization. Experimental results on HSBM and tent graphons validate the theoretical rates, and zero-shot transfer experiments across various graphon classes show successful deployment of learned GNDEs on larger, independently sampled graphs.
Methodology
The authors utilize a theoretical framework based on graphon limits to analyze the convergence of GNDEs on sparse random graphs. They establish well-posedness for Graphon-NDEs and derive convergence rates for both forward trajectories and adjoint systems. The study also examines the consistency of training paradigms (DTO and OTD) under specific discretization methods.
Results
The paper demonstrates that GNDE solutions converge to their Graphon-NDE counterparts at a rate of O((αnn)−1/2) for sparse random graphs. It also establishes uniform convergence bounds for adjoint systems and shows that both DTO and OTD training methods are asymptotically consistent, with discrepancies decreasing as the number of discretization steps increases.
Implications
The findings suggest that GNDEs can be effectively trained on smaller graphs and deployed on larger, structurally similar graphs without retraining, which can significantly reduce computational costs in applications involving large-scale graph dynamics. This has potential applications in fields such as social network analysis, epidemiology, and traffic forecasting.
TeRoR: Decoupled Temporal Rotation with Relational Circular Region for Temporal Knowledge Graph Embedding
Graph Learning
Time Series
- TeRoR enhances temporal information representation by decoupling the temporal influence on subject and object entities.
- The model introduces a relation-aware circular region to effectively capture complex multi-relational interactions.
- Experimental results show significant performance improvements over existing state-of-the-art temporal knowledge graph embedding models.
- TeRoR addresses limitations in existing models regarding the mapping properties of various relations.
Read more
TeRoR: Decoupled Temporal Rotation with Relational Circular Region for Temporal Knowledge Graph Embedding
Summary
The paper introduces TeRoR, a novel temporal knowledge graph embedding method that improves upon the existing TeRo model by addressing its limitations in modeling complex relational mappings and temporal information. TeRoR decouples the temporal evolution of entity embeddings, allowing for independent rotation transformations on head and tail entities in a complex vector space. This approach enhances the model's ability to represent temporal information effectively. Additionally, TeRoR incorporates a radius parameter to constrain the rotated head entities within a circular region centered on the tail entity, which captures diverse relational mappings such as one-to-many and many-to-one interactions. The authors conducted extensive experiments on four distinct temporal knowledge graph datasets, demonstrating that TeRoR outperforms state-of-the-art models in link prediction tasks, showcasing its potential for improving the completeness and accuracy of temporal knowledge graphs.
Methodology
TeRoR employs a decoupled approach for temporal evolution of entity embeddings, applying independent rotation transformations in a complex vector space. It also utilizes a radius parameter to define a circular region for valid quadruples, enhancing the model's ability to capture diverse relational mappings.
Results
TeRoR achieved competitive performance on four distinct TKG datasets, outperforming state-of-the-art models in key evaluation metrics such as Mean Reciprocal Rank (MRR) and Hits@K, indicating its effectiveness in temporal link prediction tasks.
Implications
The advancements presented in TeRoR have significant implications for improving the accuracy and completeness of temporal knowledge graphs, which are crucial for various AI applications such as question answering, information retrieval, and recommendation systems.
Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF
Reinforcement Learning
Theory
Optimization
- Introduces Retroactive Advantage Correction (RAC) to handle delayed rewards in RLHF.
- Proves that RAC provides an unbiased correction under specific conditions.
- Demonstrates a significant reduction in policy bias (up to 47.9×) in a tabular MDP setting.
- Integrates seamlessly with existing reinforcement learning algorithms like PPO and GRPO.
Read more
Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF
Summary
This paper addresses the challenges of reinforcement learning from human feedback (RLHF) in production environments where rewards are not always received synchronously. The proposed method, Retroactive Advantage Correction (RAC), allows for the correction of biases that arise when rewards are delayed. By queuing slow rewards and reinjecting them into the next optimization step using a non-negative kernel, RAC effectively mitigates the bias introduced by delayed signals. The authors demonstrate that under certain conditions, the cumulative correction is unbiased, and they provide a closed-form expression for the bias that is linear in the unreinjected fraction of rewards. The effectiveness of RAC is validated through a proof-of-concept on a tabular Markov decision process (MDP), where it significantly reduces policy bias compared to traditional methods. The paper also integrates RAC with existing algorithms like PPO and GRPO, showcasing its practical applicability in real-world scenarios.
Methodology
The methodology involves queuing slow rewards that arrive after a delay and reinjecting them into the next optimization step's advantage calculation. This is done using a non-negative kernel to age the rewards and a clipped importance sampling ratio to ensure unbiasedness. The paper derives theoretical results regarding the bias introduced by this method and validates it through empirical tests on a tabular MDP.
Results
The implementation of RAC on a 3×2 tabular MDP resulted in a reduction of closed-form policy bias by up to 47.9× in configurations with two slow channels. The method was also verified against machine-precision checks on a 7B-scale, confirming the theoretical predictions regarding bias scaling and equivalence to V-trace under certain conditions.
Implications
The findings suggest that RAC can significantly improve the performance of RLHF systems in production by addressing the common issue of delayed rewards. This has potential applications in various domains where RLHF is employed, such as robotics, game playing, and interactive AI systems, enhancing their reliability and efficiency.
Towards Automating Scientific Review with Google's Paper Assistant Tool
Theory
Large Language Models
Efficient ML
- Introduction of the Paper Assistant Tool (PAT) for automating scientific review.
- PAT improves error detection in mathematical proofs by 34% over traditional methods.
- Pilot deployments at major conferences showed positive community feedback.
- A proposed taxonomy outlines levels of AI-human collaboration in peer review.
Read more
Towards Automating Scientific Review with Google's Paper Assistant Tool
Summary
The paper addresses the growing challenge of scientific peer review in the face of an increasing volume of AI-assisted research outputs. The authors propose the Paper Assistant Tool (PAT), an AI framework designed to automate the review and verification process of scientific manuscripts. PAT is capable of ingesting full papers, evaluating theoretical results, validating experiments, suggesting improvements, and identifying flaws. By employing inference scaling techniques, PAT achieves a 34% improvement in detecting mathematical errors compared to traditional methods. Pilot tests at two major computer science conferences, STOC and ICML, demonstrated PAT's effectiveness in identifying critical errors and providing constructive feedback to authors. The authors also propose a taxonomy of AI-human collaboration levels in scientific evaluation, discussing the trade-offs associated with each level. This work highlights the potential of AI tools like PAT to alleviate the cognitive burden on human reviewers while maintaining their authority in the review process.
Methodology
The Paper Assistant Tool (PAT) utilizes deep inference scaling techniques to analyze scientific manuscripts. It specializes in detecting mathematical and logical errors and providing comprehensive feedback. The design of PAT allows for multiple inference calls to overcome the limitations of single model context windows, enabling deeper analysis of complex papers.
Results
PAT demonstrated a 34% improvement in zero-shot recall for identifying mathematical errors in the SPOT benchmark. Pilot tests indicated that PAT effectively identified critical errors and suggested substantial improvements in research papers submitted to STOC and ICML.
Implications
The development of PAT suggests a significant shift in the peer review process, potentially allowing for more efficient handling of the increasing volume of scientific submissions. By integrating AI tools, the academic community can enhance the quality of research outputs while reducing the burden on human reviewers. This could lead to policy discussions about the role of AI in scientific evaluation and publication decisions.
Disentangling Continuous-Time Latent Dynamics: Identifiability of Latent SDEs via Diffusion Shifts
Time Series
Theory
- Identifiability of continuous-time latent SDEs is achieved using diffusion shifts.
- Two diagonal diffusion regimes with distinct variance ratios can identify latent coordinates.
- The proposed method does not require sparsity assumptions on the drift.
- A practical two-stage estimator is developed for latent disentanglement and graph recovery.
Read more
Disentangling Continuous-Time Latent Dynamics: Identifiability of Latent SDEs via Diffusion Shifts
Summary
This paper addresses the challenge of identifiability in continuous-time latent stochastic differential equation (SDE) models, a gap in causal representation learning (CRL) for time series. The authors focus on additive-noise latent SDEs observed through an unknown nonlinear diffeomorphism, where the drift is shared across environments but the diffusion covariance varies. They demonstrate that two diagonal diffusion regimes with distinct coordinate-wise variance ratios can identify the latent coordinates up to permutation and scaling, without requiring sparsity assumptions on the drift. The authors first establish this result for linear Ornstein–Uhlenbeck systems and extend it to general additive-noise latent SDEs. They propose a two-stage estimator for latent disentanglement and optional graph recovery, validated through synthetic experiments and applied to real sensor data from the Hardanger Bridge monitoring study.
Methodology
The authors formulate the problem of continuous-time CRL for latent SDEs and analyze the identifiability of latent coordinates using diffusion shifts. They prove identifiability for linear Ornstein–Uhlenbeck systems and generalize the result to additive-noise latent SDEs. A two-stage estimator is proposed, first performing diffusion-based disentanglement followed by sparse drift regression for graph recovery.
Results
The paper establishes that the latent coordinate system can be identified up to permutation and scaling under the conditions of distinct variance ratios in two diagonal diffusion regimes. The proposed estimator successfully recovers the latent structure in both synthetic and real-world scenarios, confirming the theoretical identifiability boundary.
Implications
The findings have significant implications for causal representation learning in continuous-time systems, particularly in fields such as climate science, biology, and healthcare, where understanding latent dynamics is crucial for accurate modeling and prediction.
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Reinforcement Learning
Large Language Models
Optimization
- EVOM automates the design of actor-critic architectures in reinforcement learning.
- The framework uses a bi-level optimization approach combining low-fidelity PPO and an LLM-based design agent.
- Experimental results show significant performance improvements over traditional and state-of-the-art methods.
- Ablation studies validate the importance of both the meta-evolution loop and the LLM design agent.
Read more
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Summary
The paper introduces EVOM, an innovative framework for automating the design of actor-critic architectures in reinforcement learning (RL). Traditional methods rely on manually designed architectures, which can be inefficient and overlook potential improvements. EVOM addresses two main challenges: the high computational cost of evaluating candidate architectures and the open-ended nature of architecture design. It employs a bi-level optimization approach, where an inner loop utilizes low-fidelity proximal policy optimization (PPO) to train weights, while an outer loop employs a large language model (LLM)-based design agent to evolve architecture programs. This decoupling allows for efficient architecture refinement without direct policy execution. Experimental results demonstrate that EVOM significantly outperforms manually designed architectures, LLM-guided random searches, and the state-of-the-art MLES method on benchmark environments Ant-v4 and HalfCheetah-v4. Ablation studies confirm the necessity of both the meta-evolution loop and the LLM design agent for achieving superior performance.
Methodology
EVOM employs a bi-level optimization framework where the inner loop trains candidate architectures using low-fidelity PPO, while the outer loop evolves architecture programs through an LLM-based design agent. This approach allows for efficient exploration of the architecture space without the need for exhaustive training of each candidate.
Results
The experiments conducted on Ant-v4 and HalfCheetah-v4 demonstrate that EVOM outperforms manually designed architectures, LLM-guided random searches, and the MLES programmatic policy search method. The results indicate that EVOM achieves superior performance metrics, confirming the effectiveness of the proposed framework.
Implications
The EVOM framework has the potential to revolutionize the automated design of neural architectures in reinforcement learning, leading to more efficient and effective learning agents. Its approach could be applied to other areas of machine learning where architecture design plays a critical role.
Physics-guided Convolutional Neural Network for Domain Growth Prediction in Systems with Conserved Kinetics
Theory
Efficient ML
- Introduction of an attention-based, physics-guided CNN for modeling phase separation.
- Incorporation of conservation constraints in the loss function to maintain order parameter consistency.
- Demonstration of the model's ability to predict long-term dynamics accurately.
- Validation of the model against known growth laws, confirming its physical relevance.
Read more
Physics-guided Convolutional Neural Network for Domain Growth Prediction in Systems with Conserved Kinetics
Summary
This paper presents a novel approach to modeling the spatiotemporal evolution of phase separation in binary mixtures governed by the Cahn–Hilliard equation using an attention-based, physics-guided convolutional neural network (CNN). The authors highlight the limitations of traditional numerical solvers and previous machine learning models in preserving the conservation of order parameters during long-term predictions. The proposed model incorporates a conservation constraint into its loss function and utilizes an attention mechanism to enhance the capture of global patterns in the evolving microstructure. The model is trained to predict the full time-evolution of phase separation, demonstrating stability and accuracy over extended time frames for both critical and off-critical mixtures. The results indicate that the model effectively captures domain growth consistent with the Lifshitz–Slyozov law, showcasing its potential as a surrogate model for complex dynamical systems governed by conserved kinetics.
Methodology
The authors developed a physics-guided convolutional neural network inspired by the residual U-Net architecture. The model integrates a conservation constraint directly into the loss function and employs an attention mechanism to capture global microstructural patterns. The training process involved generating datasets based on the Cahn–Hilliard equation and evaluating the model's performance on both training and validation datasets.
Results
The trained model demonstrated stable and accurate predictions of phase separation over long time periods, preserving the mixture composition throughout evolution. It effectively captured the growth of domain size and was consistent with the Lifshitz–Slyozov domain-growth law, validating its effectiveness as a surrogate model for systems with conserved kinetics.
Implications
The proposed framework has significant implications for modeling complex dynamical systems in physics, chemistry, and biology, where traditional numerical methods are computationally expensive. It opens avenues for further research into machine learning applications in phase-field modeling and other areas requiring conservation of quantities.
COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives
Computer Vision
Interpretability
- Introduction of COCOLogic-V2, enhancing the scope of visual inductive reasoning tasks.
- Dataset categorization into positive variants, near-boundary, and far-from-boundary negatives for better model evaluation.
- Current interpretable models perform well on clear cases but fail on near-boundary samples, indicating a lack of true logical understanding.
- COCOLogic-V2-FS provides a resource for few-shot learning in complex reasoning tasks.
Read more
COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives
Summary
The paper introduces COCOLogic-V2, an object-centric dataset designed for visual inductive reasoning on real-world images, expanding upon the limitations of its predecessor, COCOLogic. The new dataset encompasses a broader range of first-order logical operations, including object counting and comparisons, and reframes the task as multilabel classification to mitigate class imbalance and shortcut learning. COCOLogic-V2 categorizes samples into positive variants, near-boundary (NB) negatives, and far-from-boundary (FB) negatives, which serve as diagnostic tools for assessing model accountability. Evaluations reveal that while models can effectively separate positive and FB samples, they struggle with NB samples, indicating a reliance on statistical patterns rather than a true understanding of the underlying logical rules. The paper also presents COCOLogic-V2-FS, a smaller version tailored for few-shot learning scenarios. Overall, the findings underscore the ongoing challenges in visual inductive reasoning and highlight the need for improved methods in this area.
Methodology
The authors developed COCOLogic-V2 by reformulating the task as multilabel classification and introducing a new sampling procedure that categorizes samples into positive variants and near-boundary negatives. They evaluated various model families on both COCOLogic-V2 and its few-shot version, COCOLogic-V2-FS, to assess their performance in visual inductive reasoning tasks.
Results
The evaluations demonstrated that while models could reliably distinguish between positive and FB samples, they consistently struggled with NB samples. This suggests that the models are leveraging statistical correlations rather than comprehending the logical rules required for accurate reasoning. The challenges posed by perceptual noise and large search spaces were also noted in few-shot learning contexts.
Implications
The findings from this research have significant implications for the development of interpretable machine learning models, particularly in high-stakes domains where accountability and transparency are crucial. COCOLogic-V2 serves as a foundational dataset for future research aimed at enhancing visual inductive reasoning capabilities in AI systems.
Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
NLP
Efficient ML
Theory
- Flexformer utilizes learnable attention kernels to achieve linear complexity in attention mechanisms.
- The model can learn a wide variety of attention patterns, including softmax attention.
- Flexformer demonstrates superior performance in language modeling and sequence classification tasks.
- It shows strong kernel transferability across different domains.
Read more
Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
Summary
Flexformer introduces a novel approach to the Transformer architecture by leveraging learnable attention kernels constructed through random Fourier features. Traditional Transformer models face scalability issues due to the quadratic complexity of their attention mechanisms, which limits their application to long sequences. Flexformer addresses this challenge by employing kernel-based linear attention that allows for linear time and space complexity while maintaining high expressiveness. The model treats spectral frequencies as trainable parameters, enabling it to learn a diverse range of attention kernels in a fully data-driven manner. Flexformer includes both stationary and nonstationary variants, with the latter providing enhanced expressiveness. The paper demonstrates that Flexformer can effectively recover softmax attention through distillation from pretrained Transformers and exhibits strong kernel transferability across different domains. Extensive experiments on language modeling and sequence classification tasks reveal that Flexformer consistently outperforms existing linear attention baselines, showcasing its efficiency and competitive performance on long-sequence tasks.
Methodology
Flexformer employs random Fourier features to construct learnable attention kernels. It optimizes spectral frequencies as parameters directly from data, allowing for a flexible and expressive family of kernels. The model includes both stationary and nonstationary variants, enhancing its capability to model complex attention patterns.
Results
Flexformer outperforms existing linear attention baselines in extensive experiments on language modeling and sequence classification tasks. It effectively approximates softmax attention through distillation and demonstrates strong performance in long-sequence tasks, achieving significant improvements in efficiency and memory consumption.
Implications
The development of Flexformer has significant implications for tasks requiring the processing of long sequences, such as natural language processing and time series analysis. Its ability to learn flexible attention patterns can enhance model performance across various domains, making it a valuable tool for researchers and practitioners in machine learning.
RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage
NLP
Multimodal
- Introduces RecallRisk-BERT, a multi-task learning framework for medical device recall triage.
- Utilizes a large dataset of FDA recall records to improve prediction accuracy for recall severity and root causes.
- Demonstrates that joint modeling of severity and root-cause categories enhances performance compared to single-task models.
- Achieves high accuracy and strong consistency with observed data, indicating the effectiveness of text–tabular learning.
Read more
RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage
Summary
The paper addresses the challenges associated with medical device recalls, which are crucial for patient safety but have become increasingly complex due to the growing volume of FDA recall records. Existing research has primarily focused on either predicting recall occurrences or analyzing root causes, often neglecting the joint modeling of recall severity and root-cause categories. To tackle this, the authors propose RecallRisk-BERT, a multi-task framework that utilizes 54,165 FDA medical device recall records from openFDA, spanning from 2002 to October 2025. The framework integrates textual representations from PubMedBERT with structured categorical features such as product code and medical specialty to predict both recall severity (Class I/II/III) and root-cause categories (9 classes) simultaneously. The study evaluates classical machine learning models and boosting-based approaches, finding that the LightGBM-based text–tabular configuration achieved the highest performance in single-task severity prediction. In the multi-task setting, RecallRisk-BERT significantly outperformed the baseline model, demonstrating strong consistency with observed root-cause severity patterns. These findings suggest that the proposed framework can enhance post-report recall triage, support regulatory decision-making, and facilitate model-based root-cause risk analysis.
Methodology
The authors developed RecallRisk-BERT, which combines textual representations from PubMedBERT with structured categorical features. They evaluated classical machine learning models and boosting-based methods for predicting recall severity and root causes, comparing single-task and multi-task learning approaches.
Results
The LightGBM-based text–tabular configuration achieved an accuracy of 0.963, macro-F1 of 0.856, and ROC-AUC of 0.974 in single-task severity prediction. In the multi-task setting, RecallRisk-BERT outperformed the PubMedBERT baseline, with model-derived risk rankings showing strong correlation with observed root-cause severity patterns (ρ = 0.983, p = 1.936 × 10−6).
Implications
The findings suggest that RecallRisk-BERT can significantly improve the efficiency and accuracy of post-report medical device recall triage, aiding regulatory bodies in decision-making and enhancing patient safety through better risk assessment.
Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform
Large Language Models
- Introduction of TerraProbe, a five-layer oracle framework for evaluating LLM-generated Terraform repairs.
- Demonstration that deceptive fixes are systemic across multiple LLMs, with rates ranging from 57.1% to 71.4%.
- Development of a formal taxonomy for deceptive fixes, validated with high inter-rater reliability.
- Statistical analysis revealing significant discrepancies between superficial success metrics and deeper evaluation criteria.
Read more
Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform
Summary
The paper introduces TerraProbe, a five-layer oracle evaluation framework designed to detect deceptive fixes in security repairs generated by large language models (LLMs) in Terraform Infrastructure-as-Code (IaC). The authors argue that existing evaluations of LLM-assisted repairs are inadequate as they often consider a repair successful merely when a targeted static-analysis finding is removed, neglecting the importance of planning validity, behavioral comparison, and alignment with security intent. TerraProbe evaluates 288 first-pass LLM-generated repairs from three models (gemini-2.5-flash-lite, GPT-4o, and Claude 3.5 Sonnet) across real-world and controlled defect modules. Statistical analyses reveal significant differences in plan-comparison reachability and demonstrate that deceptive-fix rates are consistent across models, indicating a systemic issue rather than model-specific failures. The paper also proposes a taxonomy of deceptive fixes and analyzes potential mechanisms behind their occurrence. The findings highlight the inadequacy of relying solely on targeted finding removal as a success criterion, emphasizing the need for a more robust evaluation framework in IaC security repair.
Methodology
The authors employed a five-layer oracle evaluation framework, conducting statistical comparisons using chi-square tests and Fisher's exact tests on LLM-generated repairs. They analyzed the repairs across real-world and controlled defect modules, focusing on the validity of plans and alignment with security intent.
Results
The study found that while 83.3% of repairs cleared targeted Checkov findings, only 10.4% achieved full-scanner cleanliness, and valid plans were produced in 39.6% of cases. Among real-world repairs, 71.4% were identified as deceptive fixes, indicating a significant gap between superficial success and actual security compliance.
Implications
The findings suggest that relying solely on targeted finding removal as a measure of success in IaC security repairs is inadequate. The TerraProbe framework can be applied to enhance the evaluation of automated repair systems, ensuring that security intent is satisfied and reducing the risk of deceptive fixes in cloud deployments.
Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduction of a preference-conditioned Bellman operator for MOMDPs.
- Proof of convergence to the Pareto-optimal values and coverage of the Pareto frontier.
- Extraction of deterministic policies from converged Q-estimates with single-step transition memory.
- Empirical validation demonstrating the algorithm's ability to recover complex trade-offs.
Read more
Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning
Summary
This paper addresses the challenge of balancing multiple conflicting objectives in real-world decision-making using Multi-Objective Reinforcement Learning (MORL). Traditional reinforcement learning methods often simplify this problem by aggregating rewards into a single scalar signal, which can overlook important trade-offs. The authors propose a novel preference-conditioned Bellman operator based on Chebyshev scalarization to compute deterministic Pareto-optimal policies for Multi-Objective Markov Decision Processes (MOMDPs). They demonstrate that this operator satisfies an enveloping property, ensuring that the estimated value functions upper-bound the true Pareto frontier and converge to a coverage set of this frontier. The paper details how to extract deterministic policies from the converged Q-estimates, allowing for the recovery of policies that correspond to any given preference while maintaining approximate Pareto-optimality. Experimental results validate the effectiveness of the proposed algorithm in capturing complex trade-offs and recovering a comprehensive set of deterministic Pareto-optimal policies.
Methodology
The authors develop a model-based Bellman operator parameterized by preference weights, leveraging Chebyshev scalarization. They provide theoretical proofs for the operator's convergence and derive error bounds. The methodology includes extracting deterministic policies from the converged Q-estimates, ensuring that these policies cover the entire Pareto frontier.
Results
The experimental results show that the proposed algorithm successfully converges to the Pareto frontier, effectively capturing all trade-offs and recovering a set of deterministic Pareto-optimal policies for various preference scenarios. The results confirm the algorithm's capability to synthesize policies that are approximately non-dominated.
Implications
This work has significant implications for various domains requiring multi-objective decision-making, such as robotics, circuit design, and recommender systems. The ability to synthesize deterministic Pareto-optimal policies can enhance decision-making processes by providing a comprehensive understanding of trade-offs and preferences.
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
NLP
Large Language Models
Reinforcement Learning
- Introduces Dedicated Feature Crosscoders (DFC) to isolate RL-specific features in language models.
- Demonstrates a +31.1% improvement in tool correctness through encode-decode reconstruction.
- Identifies capability spillover, achieving a +6.8% increase in tool-correctness in a frozen model.
- Shows that steering a single A-exclusive neuron can lead to a +65.0% improvement in tool-correctness.
Read more
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Summary
This paper investigates how reinforcement learning (RL) fine-tuning modifies the internal representations of language models, specifically focusing on tool use capabilities. The authors introduce Dedicated Feature Crosscoders (DFC) to isolate a compact set of RL-specific features that enhance tool-calling abilities in the Qwen2.5-3B model. Through a systematic hyperparameter sweep of 48 crosscoder variants, they demonstrate that the DFC architecture effectively concentrates RL-induced capabilities into a minimal feature set, allowing for runtime behavioral control of agentic language models. The study reveals a significant improvement in tool correctness and identifies a phenomenon termed 'capability spillover,' where the frozen base model benefits from the RL fine-tuning without additional retraining. The findings underscore the potential of DFC-based model diffing as a tool for mechanistic interpretability and behavioral modulation in language models.
Methodology
The authors trained and evaluated 48 crosscoder variants, including DFC and standard crosscoders, through a hyperparameter sweep. They utilized a systematic approach to partition the model's features into exclusive and shared categories, applying gradient masking to enforce exclusivity. The training involved a combination of general-domain samples and specific instruction-output pairs for tool invocation.
Results
The study found that DFC architecture significantly enhances the tool-calling capabilities of the Qwen2.5-3B model, with notable improvements in tool correctness metrics. The capability spillover effect was observed, allowing the frozen base model to benefit from the RL fine-tuning without additional training. Steering a single neuron within the A-exclusive partition led to substantial improvements in performance, indicating the effectiveness of feature-level interventions.
Implications
The findings suggest that DFC-based model diffing could be a valuable approach for understanding and modulating the internal representations of language models, particularly in enhancing their agentic behaviors. This could have applications in developing more interpretable and controllable AI systems that can perform complex tasks involving tool use.
RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
NLP
Large Language Models
- Introduction of RSPC, the first benchmark linking psychiatric conditions with relational stressors in digital communication.
- Utilization of psychiatrist annotations for a clinically grounded approach to mental health modeling.
- Benchmarking of various transformer models and LLMs reveals distinct capabilities in handling relationally contextualized mental health tasks.
- Strong associations found between anxiety disorders and relational uncertainties, emphasizing the importance of context in psychiatric inference.
Read more
RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
Summary
The paper introduces the Relational Stress and Psychiatry Corpus (RSPC), a novel benchmark aimed at modeling stress and psychiatric conditions within digitally mediated relationships, specifically focusing on long-distance relationships (LDRs). Unlike previous NLP approaches that treat mental health issues as isolated phenomena, RSPC incorporates relational contexts by analyzing 1,799 Reddit posts annotated by psychiatrists for various diagnostic categories, including mood disorders like anxiety and depression, relational stressors, and relationship phases. The authors benchmark seven fine-tuned transformer models and five large language models (LLMs) on tasks such as multi-label disorder classification, relational trigger detection, and temporal phase prediction. The findings reveal significant differences in model performance across tasks, with Claude-3-Haiku achieving the highest disorder classification performance (Macro-F1 = 0.538) and GPT-4o excelling in relational trigger detection (Macro-F1 = 0.519). The study highlights strong correlations between anxiety disorders and chronic relational uncertainty, advocating for a shift from individual-centric to context-aware mental health modeling that captures the social and temporal dynamics of distress.
Methodology
The study employs a clinically informed annotation framework developed in collaboration with licensed psychiatrists, resulting in a dataset of 1,799 Reddit posts annotated for DSM-5-TR and ICD-11 aligned psychiatric categories, relational stressors, and relationship phases. The authors benchmark multiple transformer models and LLMs across various tasks to evaluate their performance in context-aware mental health modeling.
Results
The results indicate that model performance varies significantly across tasks, with Claude-3-Haiku achieving the best performance in disorder classification (Macro-F1 = 0.538) and GPT-4o showing the strongest results in relational trigger detection (Macro-F1 = 0.519). The study also finds a notable correlation between anxiety disorders and chronic relational uncertainty.
Implications
The RSPC benchmark supports the development of more nuanced NLP models that consider relational contexts in mental health, potentially leading to improved diagnostic tools and therapeutic interventions in digitally mediated relationships. It encourages a paradigm shift in mental health research from individual-centric approaches to those that account for interpersonal dynamics.
HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models
Audio & Speech
Multimodal
Large Language Models
- HybridCodec combines discrete and continuous audio representations to mitigate information loss.
- The architecture includes a hybrid Transformer that supports both autoregressive and non-autoregressive predictions.
- Experimental results show significant improvements in speaker characteristic retention and reduced autoregressive steps.
- The framework effectively handles multiple speech tasks within a single model.
Read more
HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models
Summary
The paper introduces HybridCodec, a novel framework that combines discrete audio representations with continuous residuals to enhance the performance of speech language models. Traditional discrete audio representations often lead to performance degradation in downstream tasks due to information loss during discretization. The proposed HybridCodec addresses this issue by integrating temporally compressed discrete tokens with dimensionality-reduced continuous residuals. The architecture includes a hybridized discrete-continuous focal modulation codec and a hybrid Transformer model, which performs autoregressive inference in the discrete domain while allowing for non-autoregressive prediction and continuous residual upsampling. The results demonstrate that HybridCodec significantly improves the retention of speaker characteristics compared to discrete-only methods and reduces the number of autoregressive steps required during inference. This framework enables efficient processing of various speech tasks, including automatic speech recognition (ASR) and text-to-speech (TTS), within a unified model, thus restoring fine-grained information lost in traditional discrete language models.
Methodology
The methodology involves the development of HybridCodec, which extends the FocalCodec architecture. It utilizes a combination of discrete tokens and continuous residuals to model audio data. The hybrid Transformer, HybridLM, processes these representations through interleaved autoregressive and non-autoregressive decoding, allowing for efficient inference and improved performance in speech tasks.
Results
Experimental evaluations on the LibriTTS dataset indicate that HybridCodec outperforms discrete-only baselines, particularly at low frame rates (e.g., 6.25 Hz), while significantly reducing the number of autoregressive steps needed for inference.
Implications
The proposed HybridCodec framework has the potential to enhance various applications in speech processing, including ASR and TTS, by providing a more efficient and effective means of integrating discrete and continuous audio representations. This could lead to advancements in multimodal systems and improved user experiences in voice-based applications.
CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association
Interpretability
- CPAgents automates the discovery of composite phenotypes, enhancing the ability to capture non-linear effects and interactions.
- The framework consists of three agents: Analyst, Proposer, and Verifier, which work iteratively to construct and validate phenotypes.
- Evaluation on a large cardiac imaging dataset showed that CPAgents outperformed baseline methods in disease discrimination.
- The discovered phenotypes are interpretable and reproducible, providing clear evidence trails for clinical application.
Read more
CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association
Summary
The paper introduces CPAgents, a novel framework designed to enhance the identification of associations between cardiac imaging phenotypes and clinical diseases. Traditional phenome-wide association studies (PheWAS) often rely on predefined single-variable phenotypes, which limits their ability to capture complex, non-linear relationships and interactions among phenotypes. CPAgents addresses these limitations by employing an iterative process involving three specialized agents: an Analyst that identifies statistical patterns and suggests transformations, a Proposer that generates medically and statistically motivated composite phenotypes, and a Verifier that evaluates these candidates based on multi-stage criteria. The framework was evaluated on a large cardiac imaging cohort, demonstrating significant improvements in disease discrimination across various classifier-disease-metric combinations. The results indicate that CPAgents can produce compact, interpretable phenotype formulas with transparent evidence trails, facilitating more robust phenotype-disease association discoveries beyond traditional expert-driven methods.
Methodology
The CPAgents framework utilizes an iterative process involving three agents: the Analyst profiles statistical properties of the data, the Proposer generates candidate composite phenotypes based on these insights, and the Verifier assesses the candidates using cross-validation and statistical tests to ensure robustness and interpretability. This approach allows for the automatic construction of composite phenotypes while maintaining numerical safety and clinical relevance.
Results
In the evaluation of CPAgents on a population-scale cardiac imaging cohort, the framework achieved the top rank in 56 out of 72 classifier-disease-metric combinations, significantly outperforming baseline methods. Gains were observed across all nine clinical disease categories, indicating that the composite phenotypes generated by CPAgents markedly improve disease discrimination.
Implications
The CPAgents framework has the potential to revolutionize cardiovascular research by enabling scalable and interpretable phenotype discovery, which can lead to improved risk stratification and better understanding of disease mechanisms. Its application could extend beyond cardiology to other areas of medical research where complex phenotype interactions are critical.
fTNN: a tensor neural network for fractional PDEs
Theory
Optimization
- Introduction of fTNN, a novel tensor neural network for fractional PDEs.
- Development of a deterministic integration framework for the fractional Laplacian.
- Construction of boundary-singularity-aware trial functions to enhance solution accuracy.
- Design of a spatiotemporally separable neural network for time-dependent fractional PDEs.
Read more
fTNN: a tensor neural network for fractional PDEs
Summary
The paper introduces fTNN, a deterministic tensor neural network method designed to solve fractional partial differential equations (PDEs) involving the fractional Laplacian on bounded domains. The authors focus on the fractional Poisson equation and time-dependent fractional advection-diffusion equations. The fTNN method employs a geometry-adapted integration split that categorizes the fractional Laplacian into three components: a singular near field, a regular interior far field, and an analytical exterior far field. Each component is treated with specific quadrature techniques, leading to a fully deterministic integration framework. To handle low-regularity solutions and associated loss functions, the authors develop boundary-singularity-aware trial functions and propose strategies for selecting leading exponents and evaluating loss functions based on the singularity structure. For time-dependent PDEs, a spatiotemporally separable neural network is designed to factorize the residuals into low-dimensional integrals, integrated with an alternating neural network subspace optimization strategy for efficient training. Numerical experiments demonstrate that fTNN achieves high accuracy on various benchmarks, outperforming existing methods like fPINN and Monte Carlo baselines, especially in scenarios with strong boundary singularities and long-time simulations.
Methodology
The fTNN method employs a geometry-adapted integration split to decompose the fractional Laplacian into singular, regular, and analytical components. Specific quadrature methods are applied to each component, and boundary-singularity-aware trial functions are constructed to address low-regularity solutions. For time-dependent PDEs, a separable neural network structure is utilized, combined with an alternating optimization strategy for training.
Results
The numerical experiments indicate that the fTNN framework achieves high accuracy across various benchmark problems, significantly outperforming existing methods such as fPINN and Monte Carlo approaches, particularly in cases with strong boundary singularities and during long-time simulations.
Implications
The fTNN method has the potential to advance the numerical treatment of fractional PDEs, which are prevalent in modeling anomalous transport and nonlocal diffusion phenomena. Its deterministic approach may lead to more reliable and efficient solutions in scientific computing and engineering applications.
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Generative Models
Theory
Optimization
- Derivation of high-dimensional effective covariance dynamics for multi-feature GANs.
- Identification of a spectral solvable region that governs learning stability and recovery.
- Demonstration of a signal-boosting mechanism through low-rank correlations.
- Empirical validation showing improved recovery of data-driven subspaces with informed generator covariance.
Read more
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Summary
This paper investigates a solvable high-dimensional model of Generative Adversarial Networks (GANs) where a linear generator learns from data characterized by structured latent covariance. Previous analyses of solvable GANs typically assumed unconditional signals with diagonal latent covariance. The authors extend this framework to accommodate class-dependent, correlated, and non-zero-mean latent structures. They demonstrate that the dynamics of the training process can be captured by a probability-weighted effective second moment, leading to deterministic ordinary differential equations (ODEs) in the high-dimensional limit. The study reveals a stability analysis that identifies a mode-wise solvable interval determined by learning rates and noise levels. The authors highlight a signal-boosting mechanism where low-rank correlations can enhance weak directions above the learnability threshold, while excessive correlations can destabilize recovery. Numerical simulations validate the proposed ODE and phase boundaries, and experiments on datasets such as MNIST, FashionMNIST, and CIFAR-10 show that informed generator covariance significantly improves alignment with the data-driven reference subspace.
Methodology
The authors utilize a theoretical framework to derive the macroscopic dynamics of GAN training in high dimensions, focusing on the effective covariance derived from class-dependent latent structures. They conduct stability analyses and numerical simulations to explore the dynamics and validate their findings against empirical data from standard datasets.
Results
The study establishes that the training dynamics of GANs with structured latent covariance can be described by a deterministic ODE, with the effective covariance playing a critical role in learning stability. The results indicate that learning initiation is governed by the leading effective eigenvalue, and full recovery requires all effective modes to remain within a specific solvable interval. The experiments confirm that using informed generator covariance leads to better alignment with the reference subspace and coherent class-conditional outputs.
Implications
The findings suggest that incorporating structured covariance into GAN training can enhance model performance, particularly in applications requiring nuanced data representation. This work could influence future GAN architectures and training methodologies, especially in fields like image generation and conditional modeling.
PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
Reinforcement Learning
- PEBS provides a solution to the calibration issues in RLHF by allowing for per-rater adjustments.
- The method utilizes empirical-Bayes shrinkage to improve the accuracy of individual annotator calibrations.
- PEBS significantly reduces RMSE in reward model predictions compared to traditional pooled approaches.
- The approach is validated on multiple datasets, demonstrating its robustness and applicability.
Read more
PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
Summary
This paper introduces PEBS, a novel per-rater empirical-Bayes shrinkage estimator designed to improve the calibration of reward models in Reinforcement Learning from Human Feedback (RLHF). Traditional reward models aggregate preferences from numerous annotators into a single global model, which can obscure individual differences in rating scales and biases. PEBS addresses this by fitting individual affine calibrators for each annotator based on a subset of their ratings and applying empirical-Bayes shrinkage towards the population mean. This approach allows for the estimation of annotator-specific calibration without the need to retrain the underlying reward model. The effectiveness of PEBS is demonstrated through experiments on the PRISM dataset, where it achieved an 8.58% reduction in within-user held-out RMSE compared to the pooled population-slope baseline. Additionally, similar results were replicated on the PluriHarms harm ratings dataset, showing a 9.66% RMSE reduction. The method highlights the importance of recognizing and addressing calibration heterogeneity among annotators in RLHF reward modeling.
Methodology
PEBS employs a per-rater empirical-Bayes shrinkage estimator that fits individual affine calibrators for each annotator based on a held-out slice of their ratings. It applies the Morris–James–Stein empirical-Bayes shrinkage technique to adjust these calibrators towards the population mean, all in closed form without retraining the reward model.
Results
The implementation of PEBS resulted in an 8.58% reduction in RMSE on the PRISM dataset and a 9.66% reduction on the PluriHarms harm ratings dataset when compared to the pooled population-slope baseline, indicating significant improvements in reward model calibration.
Implications
The findings suggest that recognizing and adjusting for individual annotator biases can lead to more accurate and reliable reward models in RLHF applications. This could enhance the performance of RL systems that rely on human feedback, ultimately leading to better alignment with user preferences.
Dangerous Liaisons of Convex Learning and Non-Affine Aggregation
Theory
Optimization
- Monotonicity of aggregated gradients is preserved only by positively affine aggregation rules.
- Non-affine aggregation leads to failures in last-iterate convergence and increased instability in algorithms.
- The paper provides a unified theoretical framework for understanding the limitations of non-affine aggregation in convex learning.
- Sufficient conditions for restoring monotonicity in non-affine aggregation are identified.
Read more
Dangerous Liaisons of Convex Learning and Non-Affine Aggregation
Summary
This paper investigates the relationship between convex learning and non-affine aggregation in first-order optimization algorithms. The authors demonstrate that the monotonicity of the update operator, which is crucial for last-iterate convergence and generalization guarantees, is preserved only under positively affine aggregation rules. They prove that non-affine aggregation leads to a violation of monotonicity, resulting in failures of steady convergence and increased algorithmic instability. The paper provides a theoretical framework that explains the limitations of non-affine aggregation in modern learning systems, which often prioritize constraints like adaptivity, privacy, and fairness. The authors also identify sufficient conditions under which monotonicity can be restored, offering insights into the design of aggregation rules that can maintain desirable properties in optimization processes.
Methodology
The authors employ theoretical analysis to establish the relationship between aggregation rules and monotonicity in gradient updates. They formulate an impossibility theorem that characterizes the conditions under which monotonicity is preserved and analyze the implications of non-affine aggregation on convergence and stability.
Results
The main result is an impossibility theorem stating that only positively affine aggregation universally preserves monotonicity. This finding indicates that non-affine aggregation can lead to convergence failures and instability in first-order optimization algorithms, which are commonly used in machine learning.
Implications
The findings suggest that while non-affine aggregation can be beneficial for enforcing specific constraints in learning systems, it may compromise the theoretical guarantees of convergence and stability. This has implications for the design of future optimization algorithms, particularly in contexts where robustness and fairness are prioritized.
OverFlowLight: Real-Time Gridlock Prevention and Traffic Signal Optimization for Urban Intersections
Reinforcement Learning
Optimization
Computer Vision
- OverFlowLight effectively detects and mitigates traffic overflow in real-time.
- The framework integrates multi-modal sensing for accurate overflow detection.
- Dynamic overflow phases are generated to clear blocking queues, improving traffic flow.
- Real-world deployments show significant reductions in overflow incidents and increased throughput.
Read more
OverFlowLight: Real-Time Gridlock Prevention and Traffic Signal Optimization for Urban Intersections
Summary
The paper introduces OverFlowLight, a novel framework aimed at preventing traffic gridlock at urban intersections by addressing the issue of queue overflow. Traditional traffic signal control (TSC) algorithms often prioritize throughput but fail to manage overflow during peak traffic hours, leading to severe congestion and safety risks. OverFlowLight employs a real-time detection mechanism that utilizes multi-modal sensing from cameras and radars to identify overflow conditions. Upon detection, it dynamically generates dedicated overflow phases to clear blocking queues, integrating a hybrid control design that combines rule-based interventions with reinforcement learning (RL) for long-term efficiency. The framework was deployed across 43 intersections in three major cities, demonstrating its compatibility with existing RL-based TSC agents. Empirical results indicate a 60.4% reduction in overflow incidents and an 18.2% increase in network throughput compared to baseline systems. This work represents a significant advancement in traffic management, providing a scalable and data-driven solution to enhance urban transportation systems.
Methodology
OverFlowLight employs a three-stage pipeline: (1) real-time overflow detection using radar and camera data, (2) construction of overflow-clearing phases based on detected overflow directions, and (3) selection of safe phases for intervention using either traditional or RL-based controllers.
Results
The deployment of OverFlowLight resulted in a 60.4% decrease in overflow incidents and an 18.2% increase in network throughput compared to existing traffic signal control systems. The framework also minimized the need for manual traffic management interventions.
Implications
OverFlowLight has the potential to significantly improve urban traffic management by providing a scalable solution to prevent gridlock, thus enhancing safety and efficiency in urban transportation systems. Its modular design allows for integration with existing traffic control systems, making it a practical choice for cities facing congestion challenges.
Stochastic Gradient Optimization with Model-Assisted Sampling
Optimization
Efficient ML
Theory
- Introduces a model-assisted sampling framework to reduce variance in stochastic gradient estimation.
- Bridges concepts from machine learning optimization and survey sampling theory.
- Empirical results indicate performance gains in a majority of experiments, particularly for medium-sized datasets.
- The proposed method integrates with existing optimizers, enhancing efficiency without altering their dynamics.
Read more
Stochastic Gradient Optimization with Model-Assisted Sampling
Summary
This paper addresses the challenge of variance in stochastic gradient estimation during machine learning optimization, particularly in deep learning where mini-batch methods like stochastic gradient descent (SGD) are commonly used. The authors propose a model-assisted sampling framework that leverages survey sampling theory to interpret mini-batch gradients as sample-based estimates from a finite population. By integrating auxiliary gradient-prediction models, the proposed method aims to create more efficient gradient estimators, reducing the inherent noise associated with stochastic gradient estimates. The framework is designed to work seamlessly with existing optimizers, enhancing their efficiency without altering their fundamental dynamics. Empirical evaluations on synthetic and benchmark datasets demonstrate significant performance improvements in 71-86% of the experiments, especially in medium-sized input spaces. Notably, when combined with momentum-based optimizers like AdamW, the new estimator shows improved generalization within fewer training epochs compared to traditional estimators.
Methodology
The authors develop a model-assisted sampling framework that interprets mini-batch gradients through survey sampling theory. This involves treating the dataset as a finite population and using auxiliary gradient-prediction models to construct more efficient gradient estimators. The framework allows for uniform sampling as a special case when no auxiliary information is utilized.
Results
The empirical results indicate that the proposed method outperforms traditional stochastic gradient estimators in 71-86% of the experiments conducted on synthetic and six benchmark datasets. The method particularly excels in medium-sized input spaces and demonstrates better generalization with momentum-based optimizers like AdamW, achieving superior performance in fewer training epochs compared to baseline estimators.
Implications
The findings suggest that integrating model-assisted sampling techniques into existing optimization frameworks can significantly enhance the efficiency and effectiveness of training deep learning models. This approach may lead to faster convergence and improved generalization, making it a valuable contribution to the field of machine learning optimization.
Escaping Iterative Parameter-Space Noise: Differentially Private Learning with a Hypernetwork
Theory
Efficient ML
Generative Models
- Introduces a hypernetwork-based framework for differentially private learning that reduces noise impact.
- DP-DeepSets architecture generates model parameters from a low-dimensional dataset embedding with a single noise injection.
- Theoretical analysis shows improved utility compared to traditional DP-SGD methods.
- Demonstrates superior performance in LoRA fine-tuning of diffusion models using limited private data.
Read more
Escaping Iterative Parameter-Space Noise: Differentially Private Learning with a Hypernetwork
Summary
This paper addresses the challenges of differentially private (DP) training of neural networks, particularly the excessive noise introduced by gradient-based methods like DP-SGD. The authors propose a novel framework that utilizes a hypernetwork to generate model parameters from a private dataset without iterative optimization in parameter space. Instead of repeatedly injecting noise into high-dimensional gradients, their approach involves embedding each data point into a low-dimensional representation, aggregating these embeddings, and adding noise only once to create a differentially private dataset embedding. The hypernetwork, trained on public datasets, then generates the target model parameters from this noisy embedding. Theoretical comparisons demonstrate that this method, termed DP-DeepSets, achieves higher utility than traditional DP-SGD under a fixed privacy budget. Additionally, the authors apply their framework to LoRA fine-tuning of diffusion models, showing improved performance over DP-SGD and other public-data-guided methods, particularly in generating high-quality outputs from limited private data.
Methodology
The proposed method, DP-DeepSets, involves embedding private dataset examples into low-dimensional representations, aggregating these embeddings, and adding differential privacy noise to create a DP dataset embedding. A hypernetwork, trained on public datasets, then generates the model parameters from this noisy embedding, avoiding iterative updates and high-dimensional noise injection.
Results
The results indicate that models generated using DP-DeepSets outperform those trained with DP-SGD in terms of utility, particularly in scenarios with limited private data. The method also achieves lower Fréchet Inception Distance (FID) scores in LoRA fine-tuning of diffusion models compared to DP-SGD and other public-data-guided methods.
Implications
This framework has significant implications for privacy-preserving machine learning, particularly in scenarios where data is scarce. It allows for effective model training while maintaining privacy, making it suitable for applications in sensitive domains such as healthcare and finance.
PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Large Language Models
Efficient ML
Optimization
- PersistentKV improves long-context LLM serving by optimizing KV-cache management.
- The system employs a native block-table decode engine to enhance efficiency.
- An adaptive scheduling policy selects between PersistentKV and FlashInfer based on workload characteristics.
- Workqueue scheduling significantly reduces launch fan-out, improving throughput.
Read more
PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Summary
The paper presents PersistentKV, a novel decode scheduling engine designed to optimize the serving of autoregressive large language models (LLMs) on commodity GPUs by addressing the inefficiencies associated with key-value (KV) cache management. Traditional LLM serving has been limited by the movement of KV caches rather than the performance of dense matrix multiplications. While existing systems like FlashInfer provide optimized paged decode attention, they do not always yield the best performance across varying workloads. PersistentKV introduces a block-table decode engine that enhances the efficiency of long-context decoding by mapping work by KV-head group and reusing KV tiles across grouped query heads. The system employs a compact workqueue schedule that focuses on executing only non-empty tasks, significantly reducing the launch fan-out during decode steps. The paper demonstrates that an adaptive policy can intelligently select between PersistentKV and FlashInfer based on the active batch size, leading to improved throughput in various workload scenarios. The results indicate that work assignment plays a crucial role in serving system performance, highlighting the importance of adaptive scheduling in LLM inference.
Methodology
The methodology involves developing a native block-table decode engine that supports grouped-query attention (GQA) and implementing a compact workqueue for paged decode. The system is evaluated against FlashInfer using synchronized wall timing, CUDA-event timing, and workload counters to assess performance across different workload scenarios.
Results
The implementation of PersistentKV resulted in a throughput improvement of 1.063–1.265× on B8 workloads and 1.399× on B1 workloads compared to FlashInfer. The adaptive policy effectively selects the optimal kernel schedule based on batch size, avoiding performance regressions in certain scenarios.
Implications
The findings suggest that adaptive page-aware decode scheduling can significantly enhance the efficiency of LLM serving on commodity hardware, making it more feasible for real-world applications. This approach could lead to better resource utilization and lower latency in LLM inference systems.
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Reinforcement Learning
Robotics
Theory
- Introduces a framework combining POMDPs with CRN dynamics for modeling phototaxis.
- Demonstrates that tumbling behavior in algae is an adaptive strategy for information acquisition.
- Uses IRL to derive a phototactic policy from experimental trajectory data.
- Establishes a connection between sensory geometry, subjective inference, and biochemical implementation.
Read more
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Summary
This paper explores the integration of reinforcement learning (RL) within chemical reaction networks (CRNs) to model phototaxis in unicellular algae, specifically focusing on how organisms navigate using incomplete sensory information. The authors propose a framework that connects a Partially Observable Markov Decision Process (POMDP) with biochemical reaction dynamics, allowing for the modeling of internal states that balance light orientation and exploratory behavior. By employing Inverse Reinforcement Learning (IRL) on recorded trajectories of Chlamydomonas, the study infers a behavioral objective consistent with observed phototactic motion. The model demonstrates that run-tumble behavior serves as an information-acquisition strategy, enabling cells to resolve sensory ambiguity and adaptively navigate their environment. The findings suggest that intracellular biochemical networks can facilitate adaptive information-seeking behaviors, linking cellular chemistry with minimal cognition in microbial navigation.
Methodology
The authors formulated the phototaxis problem as a subjective POMDP, utilizing experimental data from Chlamydomonas trajectories to perform Inverse Reinforcement Learning. They implemented the model using Chemical-Reaction-Network Ordinary Differential Equations (CRN–ODEs) to simulate internal dynamics and decision-making processes.
Results
The model successfully reproduced the empirical alignment-to-light distribution observed in Chlamydomonas, showing comparable performance to standard Stochastic Simulation Algorithm (SSA) baselines. The findings indicate that tumbling is not merely stochastic noise but a strategic behavior for resolving sensory ambiguity and optimizing navigation.
Implications
This research has implications for understanding how simple biochemical processes can lead to complex adaptive behaviors in living organisms. It bridges the gap between reinforcement learning theories and biological navigation, potentially influencing the design of bio-inspired algorithms and systems in robotics and artificial life.
Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data
Multimodal
- Introduces DLPMAC, a model for aligning and fusing incomplete multimodal data.
- Utilizes Dual-Learning to maintain semantic and structural consistency across modalities.
- Employs a penalty mechanism to improve alignment accuracy and prevent excessive sample aggregation.
- Demonstrates effectiveness through experimental validation in real-world scenarios.
Read more
Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data
Summary
This paper addresses the challenges of multimodal feature fusion in scenarios where data is incomplete and disordered due to factors like equipment failures and inconsistent sensor sampling. The authors propose a novel model called Dual-Learning based Penalized Multi-Align Clustering (DLPMAC) that improves data alignment and fusion accuracy. The model incorporates a Dual-Learning mechanism to capture both semantic and structural information from different modalities, ensuring consistency across local and global levels. Additionally, the Penalized Multi-Align module allows for multi-to-multi data alignment through a penalty mechanism, which enhances the accuracy of data pair alignments and mitigates the risk of excessive sample aggregation. Experimental results demonstrate the effectiveness of DLPMAC in overcoming alignment and fusion challenges, validating its potential for real-world applications such as boiler combustion monitoring.
Methodology
The DLPMAC model integrates a Dual-Learning mechanism to leverage the inherent knowledge of each modality's data, focusing on both semantic and structural aspects. The Penalized Multi-Align module facilitates multi-to-multi data alignment using a penalty approach, allowing for more accurate pairings of samples across modalities and enhancing overall data fusion performance.
Results
The experimental results indicate that DLPMAC significantly improves alignment accuracy and fusion performance compared to existing methods, effectively addressing the challenges posed by incomplete and disordered multimodal data. The model's ability to maintain consistency and enhance data pairing accuracy was validated through various tests.
Implications
The proposed model has significant implications for industries relying on multimodal data, such as monitoring systems in manufacturing and environmental applications. By improving data alignment and fusion, DLPMAC can enhance decision-making processes and operational efficiency in real-time monitoring scenarios.
Halt Fast! Early Stopping for Certified Robustness
Theory
Efficient ML
Computer Vision
- Introduces a meta-learning framework for anytime-valid certified robustness.
- Achieves a 20-fold reduction in sample complexity compared to traditional RS methods.
- Enables adaptive termination conditions based on application-specific risk thresholds.
- Demonstrates potential for real-time applications in safety-critical environments.
Read more
Halt Fast! Early Stopping for Certified Robustness
Summary
This paper addresses the computational inefficiencies associated with Randomized Smoothing (RS) for certified robustness in neural networks. While RS offers rigorous guarantees against adversarial examples, its high sample complexity has limited its practical application. The authors propose a meta-learning framework that allows for anytime-valid certified robustness, significantly reducing the number of required model evaluations. By employing a lightweight meta-learner to predict image-specific priors, the proposed method achieves a 20-fold reduction in sample complexity while maintaining statistical guarantees. The framework also introduces task-adaptive termination conditions, enabling dynamic allocation of computational resources based on application-specific risk thresholds. Experimental results demonstrate that the new approach can construct certifications in under 500 samples, making it viable for real-time safety-critical applications.
Methodology
The authors extend E-value certifications to a mixture-based multiple-hypothesis framework and introduce a sample-adaptive meta-learning approach. A lightweight meta-learner predicts the prior distribution of success rates for the smoothed model, optimizing the efficiency of the certification process. Additionally, they implement task-adaptive termination conditions that allow for early stopping based on predefined domain tasks.
Results
The proposed method allows for the construction of certifications in less than 500 samples, which is a significant improvement over previous methods that required tens of thousands of evaluations. The approach maintains rigorous statistical guarantees while enabling adaptive resource allocation based on application-specific needs.
Implications
This research has significant implications for deploying certified robustness in real-time applications, such as autonomous driving and large-scale systems, where computational efficiency and adaptability are critical. The ability to dynamically allocate resources based on risk thresholds can enhance the safety and reliability of machine learning systems in practice.
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
- Identification of directional inconsistency as a significant failure mode in online RL for LLMs.
- Introduction of GEOALIGN, a lightweight module for effective rollout curation.
- Demonstrated improvements in performance and stability over existing robust RL methods.
- GEOALIGN operates without requiring per-rollout policy gradients, ensuring efficiency.
Read more
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Summary
The paper introduces GEOALIGN, a novel approach to stabilize online reinforcement learning (RL) for large language models (LLMs) by addressing a specific failure mode termed 'directional inconsistency.' This issue arises when a small number of high-reward rollouts produce conflicting update directions that destabilize training. GEOALIGN operates as a lightweight plug-in for rollout curation during iterative policy optimization. It forms preference pairs within prompts, learns to project hidden states to concentrate reward-ordered directions, and identifies directionally inconsistent rollouts to rectify them with stable alternatives. The method is designed to be efficient, requiring only forward passes and adding minimal overhead. The authors evaluate GEOALIGN on tasks such as dialogue alignment and mathematical reasoning, demonstrating its effectiveness in improving performance and reducing training oscillation compared to existing robust RL methods. The results indicate that leveraging latent directional consensus can serve as a reliable signal for enhancing online LLM RL stability.
Methodology
GEOALIGN employs a series of steps to curate rollouts: it forms within-prompt preference pairs, learns a projector to distill reward-ordered directions, builds a consensus prototype for the batch, and identifies directionally inconsistent rollouts for rectification using stable alternatives. This process is performed in a forward-pass manner, minimizing computational overhead.
Results
GEOALIGN significantly enhances final performance and training stability in both dialogue alignment and mathematical reasoning tasks. It outperforms strong baselines such as PF-PPO, PAR, PODS, and Seed-GRPO, particularly under conditions of controlled reward corruption, indicating its robustness and effectiveness.
Implications
The findings suggest that GEOALIGN could be applied to various online RL scenarios involving LLMs, particularly in environments with noisy or misspecified rewards. Its lightweight nature makes it suitable for real-time applications where efficiency is critical.
Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis
Generative Models
Optimization
- NAS significantly improves the design and performance of GANs.
- Evolutionary algorithms and gradient-based methods show superior performance in certain contexts.
- Robust evaluation metrics are essential for accurately assessing GAN performance.
- Diverse datasets are crucial for evaluating the effectiveness of GAN architectures.
Read more
Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis
Summary
This paper provides a comprehensive review of Neural Architecture Search (NAS) techniques applied to Generative Adversarial Networks (GANs). It categorizes and compares various NAS methods based on search strategies, evaluation metrics, and performance outcomes. The authors highlight the advantages of NAS in enhancing GAN performance, stability, and efficiency while identifying limitations and future research directions. Key findings indicate that evolutionary algorithms and gradient-based methods often outperform others in specific contexts. The paper emphasizes the need for robust evaluation metrics beyond traditional scores like Inception Score (IS) and Fréchet Inception Distance (FID), as well as the importance of diverse datasets for assessing GAN performance. By structuring a comparison of existing NAS-GAN techniques, the authors aim to guide researchers in developing more effective NAS methods and advancing the field of GANs.
Methodology
The authors developed a framework to categorize and compare different NAS techniques based on key criteria identified through an extensive literature review. They conducted a critical analysis of existing NAS-GAN techniques, evaluating their approaches and outcomes.
Results
The review revealed that while NAS can enhance GAN performance, challenges remain, particularly in achieving stability during training. The analysis highlighted the effectiveness of certain search strategies and the necessity for improved evaluation metrics.
Implications
The findings of this review can guide future research in NAS for GANs, leading to the development of more effective architectures and methodologies. This could have significant applications in various fields, including image generation, medical imaging, and data augmentation.
Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
Graph Learning
Interpretability
Time Series
- Introduction of a novel Event Relevance (ER) method that captures the entire information flow in ETGNNs.
- Extension of the Normalized Relevance Measure (NRM) framework to facilitate modular decomposition for complex networks.
- Demonstration of superior performance in qualitative and quantitative evaluations compared to existing explanation methods.
- Ability to analyze higher-order interactions among events, enhancing the interpretability of model predictions.
Read more
Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
Summary
This paper addresses the challenge of explainability in Event-based Temporal Graph Neural Networks (ETGNNs), which are increasingly used in various applications such as social network analysis and epidemic tracing. Existing explanation methods typically focus on a limited subset of information flow, primarily tracing contributions from event-related embeddings to outputs, and neglect the significant pathways through event-induced variables that mediate interactions between nodes. To overcome this limitation, the authors propose a novel attribution method that analyzes the entire information flow through all event-associated variables. This method is built on the Normalized Relevance Measure (NRM) framework, allowing for explicit quantification of information flow from event embeddings and through event-induced variables. The authors extend the NRM framework with a modular decomposition procedure to systematically construct relevance structures for complex neural architectures. The proposed method is evaluated on synthetic datasets for epidemic tracing and social dynamics, as well as a real-world dataset of political event networks. The results demonstrate that the new method consistently outperforms existing approaches, providing more human-interpretable explanations and capturing long-range temporal dependencies effectively.
Methodology
The authors propose an Event Relevance (ER) framework that utilizes the Normalized Relevance Measure (NRM) to analyze information flow throughout the ETGNN architecture. This includes a modular decomposition procedure that simplifies the application of NRM to complex networks, allowing for the tracing of information flow through event-induced variables and the identification of higher-order interactions among events.
Results
The proposed ER method outperformed existing explanation methods in both qualitative and quantitative evaluations across synthetic datasets for epidemic tracing and social dynamics, as well as a real-world dataset of political events. The results indicate that the new method provides more accurate and interpretable explanations of model predictions.
Implications
The findings suggest that the proposed ER method can significantly enhance the interpretability of ETGNNs in high-stakes applications, where understanding the rationale behind predictions is crucial. This could lead to improved trust and transparency in AI systems used for social network analysis, epidemic modeling, and other dynamic relational data applications.
PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction
Reinforcement Learning
Generative Models
- PerturbCellRL incorporates biological verifiers as reward functions to enhance single-cell perturbation predictions.
- The framework improves individual cell response plausibility while maintaining population-level distributional quality.
- Evaluation on multiple benchmarks shows significant improvements in reward-aligned metrics.
- The approach emphasizes the importance of biological consistency in generative modeling for computational biology.
Read more
PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction
Summary
The paper introduces PerturbCellRL, a novel reinforcement learning framework designed to enhance single-cell perturbation prediction by ensuring biological consistency in generated cell responses. Traditional generative models focus on population-level predictions but often fail to validate individual cell outputs against known biological behaviors. PerturbCellRL addresses this gap by employing a suite of biological verifiers that serve as reward functions during the post-training of a pretrained single-cell transcriptomic generator. These verifiers assess generated cells based on four criteria: Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity, which collectively ensure that the generated responses align with expected biological outcomes. The authors evaluate PerturbCellRL on various genetic and chemical perturbation benchmarks, demonstrating significant improvements in reward-aligned metrics and maintaining competitive performance on population-level evaluations. The results highlight the importance of integrating biological checks into generative models, moving the field towards more reliable predictions that can support drug discovery and personalized medicine while reducing reliance on costly wet-lab experiments.
Methodology
PerturbCellRL utilizes a pretrained flow-matching generator and applies reinforcement learning to post-train the model using a suite of biological verifiers as reward signals. The model generates multiple candidate responses for a given perturbation, scores them based on the verifiers, and updates the generator to favor high-reward outputs. At inference, the pathway activity verifier is used for selecting the most biologically plausible predictions from the generated candidates.
Results
PerturbCellRL demonstrates improved performance over the baseline flow-matching generator on reward-aligned evaluation metrics and a held-out evaluation metric across multiple genetic and chemical perturbation benchmarks. The best-of-N selection process further enhances the biological consistency of predictions without sacrificing distributional quality, keeping PerturbCellRL competitive with state-of-the-art methods.
Implications
The findings suggest that incorporating biological verifiers into generative models can significantly enhance the reliability of single-cell perturbation predictions, which is crucial for applications in drug discovery and personalized medicine. This approach may reduce the need for expensive wet-lab experiments by providing more accurate in silico predictions.
Learning to Reason with Curriculum II: Compositional Generalization
Theory
Reinforcement Learning
Large Language Models
- Compositional generalization is essential for effective reasoning in AI.
- An autocurriculum approach significantly reduces the learning complexity compared to direct methods.
- In supervised fine-tuning, the curriculum allows learning from only 2 e^O(√log T) tokens.
- In reinforcement learning, the curriculum relaxes the coverage requirement on the reference model.
Read more
Learning to Reason with Curriculum II: Compositional Generalization
Summary
This paper investigates compositional generalization, which is the ability to solve complex problems by combining solutions to simpler sub-problems, a crucial aspect of both natural and artificial intelligence. The authors explore the theoretical foundations of this capability, particularly in the context of learning to simulate semiautomata, a model that encompasses state tracking, regular language recognition, and modular arithmetic. They propose an autocurriculum-based approach that recursively decomposes longer sequences into shorter sub-problems, facilitating more efficient learning compared to direct methods. The study reveals that in a supervised fine-tuning setting, the curriculum allows learning from a significantly reduced number of tokens, achieving subpolynomial complexity in the sequence length. In a reinforcement learning context, the curriculum relaxes the coverage requirement on the reference model, demonstrating that compositional structure can guide learners in solving tasks beyond their initial capabilities. Overall, the findings highlight the efficiency of curriculum learning in enhancing compositional generalization and provide insights into the design of self-generated curricula.
Methodology
The authors employ an autocurriculum-based approach to decompose complex problems into simpler sub-problems. They analyze two settings: one inspired by supervised fine-tuning (iSFT) and another by reinforcement learning with verifiable rewards (RLVR). The methodology includes recursive problem decomposition, learning from interactive feedback, and leveraging compositional structures to enhance learning efficiency.
Results
The results demonstrate that the autocurriculum approach achieves dramatically better statistical complexity than direct methods. Specifically, in the iSFT setting, it allows for learning with significantly fewer tokens than required by direct simulation. In the RLVR setting, it reduces the coverage requirement on the reference model from the full sequence length to a much shorter block length, indicating a substantial improvement in learning efficiency.
Implications
The findings suggest that curriculum learning can be a powerful tool in enhancing the capabilities of AI models, particularly in tasks requiring complex reasoning. This has implications for the design of training protocols in various AI applications, potentially leading to more efficient learning systems that can tackle complex problems more effectively.