AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
45
Papers today
8h
Update frequency
7
Days of history
Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling
Time Series
Interpretability
- Introduces a liquid latent dynamics model for turbofan sensor forecasting.
- Factorizes latent state into degradation and condition components for better interpretability.
- Achieves improved sensor forecasting RMSE compared to GRU baseline, especially in complex conditions.
- Provides a clearer temporal degradation axis, enhancing interpretability of health states.
Read more
Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling
Summary
This paper presents a novel approach to prognostics in aircraft engine health monitoring using liquid neural networks as latent dynamics models. The authors address the challenge of accurately modeling the degradation process of turbofan engines while distinguishing it from variations in operating conditions. The proposed model encodes historical sensor data into a latent state, which is then evolved using a liquid transition model to predict future sensor observations. A key innovation is the factorization of the latent state into degradation and condition components, allowing for more interpretable health state trajectories. The model is evaluated on the C-MAPSS benchmark, demonstrating improved sensor forecasting accuracy compared to a GRU baseline, particularly in complex operating conditions. While the model excels in providing interpretable degradation dynamics, it does not yet surpass the GRU in direct remaining useful life (RUL) regression, indicating that it serves better as an interpretable model rather than a calibrated lifetime predictor.
Methodology
The authors utilize liquid neural networks to create a latent dynamics model that encodes historical sensor data into a latent state. This state is evolved using a liquid transition model, and future observations are decoded from this state. The latent state is factorized into degradation and condition components, with specific losses applied to each to ensure meaningful separation and interpretation.
Results
The proposed model improved overall sensor forecasting RMSE from 0.2438 (GRU baseline) to 0.2266, with significant gains observed in the more complex subsets FD002 and FD004. The learned degradation state exhibited a Spearman correlation of 0.5960, indicating a clearer temporal degradation axis. However, the model did not outperform the GRU baseline in direct RUL regression.
Implications
This research suggests that liquid latent dynamics can enhance predictive maintenance by providing interpretable models of degradation dynamics, which are crucial for understanding the health trajectory of complex engineering systems. The findings may influence future work in prognostics and health management, particularly in industries reliant on predictive maintenance.
Black-Box Inference of LLM Architectural Properties with Restrictive API Access
Large Language Models
NLP
Theory
- NightVision can infer architectural properties of LLMs even with restrictive API access.
- The method combines common-set prompting and timing analysis to recover hidden dimension, depth, and parameter count.
- Empirical results show a mean relative error of 23% for hidden dimension and 53% for depth and parameter count.
- Current API restrictions are insufficient to fully obfuscate LLM architectural details.
Read more
Black-Box Inference of LLM Architectural Properties with Restrictive API Access
Summary
This paper addresses the challenge of inferring architectural properties of large language models (LLMs) when access to their APIs is restricted. Despite commercial LLM providers limiting the information available through their APIs, the authors demonstrate that it is still possible to recover key architectural parameters such as hidden dimension, depth, and parameter count. They introduce an attack method called NightVision, which utilizes a novel prompting technique to gather log probabilities for a common set of output tokens, allowing for spectral analysis to estimate the hidden dimension. Additionally, the method incorporates timing measurements to infer depth and parameter count. The authors empirically validate NightVision on 32 open-source LLMs, achieving a mean relative error of 23% for hidden dimension estimation and 53% for depth and parameter count in larger models. The findings suggest that current API restrictions do not fully protect sensitive architectural details, indicating a need for enhanced security measures.
Methodology
The authors developed NightVision, a two-pronged inference algorithm. The first prong employs common-set prompting to recover the hidden dimension from single-logit API access. The second prong uses timing measurements to estimate depth and parameter count, leveraging the relationship between inference time and model architecture parameters.
Results
NightVision was evaluated on 32 open-source LLMs, successfully recovering the hidden dimension with an average relative error of 23% and achieving exact recovery in 4 cases. For models with over three billion parameters, depth and parameter count were estimated with a mean relative error of approximately 53%. The accuracy of these estimates varied based on the token budget provided to the algorithm.
Implications
The findings have significant implications for LLM API design and security, suggesting that simply restricting access to logit information is not enough to protect sensitive architectural details. This research may influence future strategies for model auditing and the development of more secure API frameworks.
BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
Large Language Models
NLP
Multimodal
- Boundary_Sync provides a standardized measurement protocol for communication-induced coupling in LLMs.
- Text communication significantly homogenizes outputs (CAF=0.803), while image communication shows similar effects (CAF=0.834).
- Group size influences the direction of coupling, with smaller groups potentially leading to diversification.
- Coupling is stateless, dependent on immediate peer information, and does not accumulate over time.
Read more
BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
Summary
This paper introduces Boundary_Sync, a measurement protocol designed to quantify the representational coupling induced by communication among large language models (LLMs) in multi-agent systems. The study investigates whether inter-agent communication leads to homogenization or diversification of outputs. The Coupling Amplification Factor (CAF) is introduced as a metric to measure this coupling, where a CAF value less than 1 indicates homogenization and greater than 1 indicates diversification. Controlled experiments using GPT-4o across text and image communication scenarios reveal significant findings: text communication leads to substantial homogenization (CAF=0.803), while image communication also results in homogenization (CAF=0.834) when compared to appropriate baselines. The study further explores how group size affects coupling direction, noting that with three agents, communication can lead to diversification (CAF > 1). The results suggest that coupling is stateless and driven by immediate peer information, with implications for the design of multi-agent LLM systems. Overall, the findings highlight the importance of understanding communication dynamics in multi-agent architectures to maintain representational diversity.
Methodology
The authors conducted controlled experiments using GPT-4o, measuring the Coupling Amplification Factor (CAF) across text and image communication scenarios. They employed a no-communication ablation and prompt perturbation controls to validate their findings. The study involved 30 agents per condition and approximately 9,900 API calls across three experimental runs.
Results
The experiments demonstrated that text communication leads to significant homogenization (CAF=0.803), while image communication also results in homogenization (CAF=0.834) when compared to modality-specific baselines. Group size was found to be a critical moderator, with smaller groups showing a shift towards diversification (CAF > 1). The coupling was determined to be stateless, with no evidence of cumulative convergence across communication rounds.
Implications
The findings suggest that careful consideration of communication dynamics is essential in the design of multi-agent LLM systems to ensure that diversity is maintained. The results can inform strategies for optimizing agent interactions and managing representational diversity in collaborative tasks.
Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model
Theory
- Introduces ridge-regularized log-density-ratio estimation in a Gaussian location model.
- Derives high-dimensional asymptotic equivalents for variational and spectral estimators.
- Demonstrates that the variational estimator outperforms the spectral estimator with many observations.
- Identifies conditions under which the spectral estimator is favored due to lower variance.
Read more
Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model
Summary
This paper investigates ridge-regularized log-density-ratio estimation within the Gaussian location model, where two distributions are compared: one centered at zero and the other at a fixed signal strength. The authors present two estimation approaches: a variational estimator that employs empirical Kullback-Leibler (KL) divergence with a squared โ2 penalty, and a spectral estimator that reformulates the problem into a continuum of ridge-regularized least-squares problems. The study derives high-dimensional deterministic asymptotic equivalents as the number of observations and dimensions grow, revealing that the variational estimator generally has lower population risk with many observations, while the spectral estimator is preferred with fewer observations due to its lower variance. The paper also explores the use of a nuclear penalty for feature learning, providing a comprehensive analysis of the performance of both estimators under varying conditions.
Methodology
The paper employs a theoretical framework that includes the derivation of finite-sample identities for both regularized and unregularized estimators, alongside a population analysis to assess shrinkage bias. It utilizes the convex-Gaussian-min-max theorem (CGMT) for variational limits and deterministic equivalents for the spectral estimator, focusing on a Gaussian location model with defined aspect ratios.
Results
The analysis indicates that the variational estimator has a smaller population risk in scenarios with a high number of observations, while the spectral estimator shows advantages in low-observation contexts due to its variance-reducing properties. The paper provides empirical comparisons and asymptotic expansions to support these findings.
Implications
The findings suggest that in practical applications involving log-density ratio estimation, the choice between variational and spectral methods should consider the number of observations available. The insights into feature learning through nuclear penalties may also enhance model performance in high-dimensional settings.
Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness
NLP
Large Language Models
Theory
- Introduces a formal framework for analyzing the geometry of CoT reasoning in LLMs.
- Defines effective dimension (dฯ) as a measure of task complexity related to reasoning trajectories.
- Demonstrates that kinematic features can predict solution correctness early in the reasoning process.
- Achieves high accuracy in distinguishing between easy and hard problems using geometric measures.
Read more
Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness
Summary
This paper investigates the internal geometry of chain-of-thought (CoT) reasoning in large language models (LLMs) by formalizing reasoning chains as discrete curves in a hidden state space. The authors introduce the effective dimension (dฯ) as a measure of trajectory complexity, demonstrating that trajectories with flatter eigenvalue spectra correspond to harder tasks, indicating a more extensive exploration of hidden dimensions. The study also examines kinematic features of the trajectories, such as mean position and velocity, to predict solution correctness early in the reasoning process. Experimental results on the MATH500 dataset show that dฯ achieves an AUC of 0.93 in distinguishing easy from hard problems, while kinematic features can predict correctness with high accuracy from just the first 20% of generated tokens. This research provides insights into how the geometric properties of reasoning trajectories can inform task hardness and solution quality, suggesting potential strategies for early stopping in reasoning tasks.
Methodology
The authors formalize CoT reasoning as discrete curves in Rd, analyzing their geometric properties through spectral, positional, and kinematic functionals. They introduce the effective dimension as a measure of trajectory complexity and use kinematic features to predict correctness based on early trajectory geometry.
Results
The effective dimension dฯ successfully distinguishes between easy and hard problems with an AUC of 0.93. Additionally, kinematic features extracted from the first 20% of generated tokens can predict correctness with an AUC of 0.806, indicating the potential for early-exit strategies in reasoning tasks.
Implications
The findings suggest that understanding the geometric properties of reasoning trajectories can enhance the performance of LLMs on complex tasks. This could lead to improved early stopping strategies and better prioritization of reasoning paths, ultimately enhancing the efficiency and effectiveness of LLMs in problem-solving scenarios.
kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
Large Language Models
NLP
- kNNGuard is a training-free guardrail framework that utilizes hidden activations from LLMs.
- It achieves competitive F1 scores while being 2.7ร faster than the best existing guardrails.
- Domain adaptation is simplified, requiring only a small labeled bank and minimal setup time.
- The framework combines activation-space and embedding-space scores for improved robustness.
Read more
kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
Summary
The paper introduces kNNGuard, a novel guardrail framework designed for large language models (LLMs) that operates without the need for training or fine-tuning. Traditional guardrails often rely on classifiers that require extensive training on curated datasets, leading to issues with generalization and increased inference latency. In contrast, kNNGuard leverages the hidden activations of a frozen LLM, utilizing a small labeled bank of safe and unsafe prompts to classify incoming prompts. The framework employs a multi-layer k-nearest neighbors (kNN) approach, fusing activation-space and embedding-space scores to enhance classification accuracy. The authors demonstrate that kNNGuard achieves competitive or superior F1 scores across six diverse domains, including code instructions and security prompts, while significantly reducing inference latency compared to existing guardrails. Additionally, kNNGuard allows for rapid domain adaptation by simply updating the labeled bank, making it practical for real-time deployment. The paper also discusses the impact of system prompts and layer selection on performance, providing insights for integrating kNNGuard into production LLM pipelines.
Methodology
kNNGuard operates by extracting hidden activations from a frozen LLM using a small bank of labeled prompts. It employs a multi-layer kNN approach, comparing incoming prompt activations against the cached activations of the labeled bank. The framework uses Fisher-discriminant-based weighting to enhance the influence of layers that best separate classes and includes a fused variant that combines activation-space and embedding-space scores.
Results
kNNGuard achieved an average F1 score of 87.4% with a false positive rate of 12.9% across six domains, demonstrating competitive performance against fine-tuned guardrails. The framework operates with a per-prompt latency of 45.9 ms, significantly faster than traditional methods, and allows for rapid domain adaptation with construction times under 10 seconds for a 50-sample bank.
Implications
The kNNGuard framework has significant implications for the deployment of LLMs in sensitive applications, providing a robust and efficient method for detecting unsafe prompts without the overhead of training. Its rapid adaptability makes it suitable for dynamic environments where prompt safety is critical.
Hybrid quantum-classical neural network for sentiment analysis
NLP
- Hybrid quantum-classical neural networks can effectively perform sentiment analysis on COVID-19-related tweets.
- The study shows comparable accuracy between hybrid models and classical baselines, with improved learning dynamics.
- Transfer learning experiments indicate a significant performance boost for hybrid models in spam classification tasks.
- The research highlights the potential advantages of quantum machine learning in natural language processing.
Read more
Hybrid quantum-classical neural network for sentiment analysis
Summary
This paper explores the application of hybrid quantum-classical neural networks (HNNs) for sentiment analysis, specifically focusing on a dataset of tweets related to COVID-19. The authors utilize a combination of classical feedforward neural networks and parameterized quantum circuits to analyze sentiment in tweets, which are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) method. The study demonstrates that hybrid models can achieve accuracy levels comparable to classical models while exhibiting distinct learning dynamics, particularly in validation loss and accuracy. Additionally, the authors conduct transfer learning experiments on an SMS spam classification task, revealing that hybrid models outperform classical counterparts by a significant margin, achieving a 15 percentage point increase in accuracy for the spam class. These findings suggest that hybrid quantum-classical approaches can enhance generalization and representational capacity in natural language processing tasks, paving the way for future research in quantum machine learning applications.
Methodology
The authors employed a dataset of tweets annotated with sentiment labels, utilizing TF-IDF for feature extraction. They compared classical feedforward neural networks with hybrid architectures that integrate parameterized quantum circuits. The quantum components were simulated classically, and the models were evaluated on sentiment classification and transfer learning tasks.
Results
The hybrid models achieved accuracy comparable to classical models in sentiment analysis, with distinct learning dynamics. In transfer learning on SMS spam classification, the hybrid models improved accuracy from 66% to 81% for the spam class, demonstrating enhanced generalization.
Implications
The results indicate that hybrid quantum-classical models could significantly improve sentiment analysis and other NLP tasks, particularly as quantum computing technology evolves. This research may guide future developments in quantum machine learning algorithms and their practical applications in real-world scenarios.
Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
Large Language Models
Graph Learning
- LLMs show fragility in generalizing molecular properties due to their reliance on local training distributions.
- The Molecular Perturbation framework reveals that small structural changes can significantly degrade model performance.
- In-Context Tuning (ICT) can improve robustness by anchoring predictions to structurally similar molecules.
- The study emphasizes the need for models to align structural variations with chemically meaningful similarities.
Read more
Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
Summary
This paper investigates the generalization capabilities of Large Language Models (LLMs) in the molecular domain, particularly their ability to handle structural variations in chemical compounds. The authors introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules using controlled Graph Edit Distance (GED). Their analysis reveals that even minor structural edits can lead to significant performance drops in molecular tasks, indicating a narrow local trust region and fragility in the models' sensitivity to structural changes. To address this issue, the authors explore In-Context Tuning (ICT), which conditions predictions on structurally similar molecules. The results suggest that ICT can partially expand the local trust region, improving robustness against structural perturbations. This study highlights the limitations of current LLMs in modeling the complexities of chemical space and proposes a direction for enhancing their stability through similarity-based inference.
Methodology
The authors developed a Molecular Perturbation framework that generates structural variants of molecules using controlled Graph Edit Distance (GED). They conducted experiments to evaluate model performance under various perturbations and examined the effects of In-Context Tuning (ICT) on robustness against structural changes.
Results
The analysis demonstrated that even a single structural edit could lead to substantial performance drops, indicating a narrow local trust region for LLMs in molecular tasks. However, the introduction of ICT showed promise in partially expanding this region, allowing models to retain better performance under controlled structural perturbations.
Implications
The findings suggest that enhancing LLMs with similarity-based inference mechanisms could improve their applicability in molecular discovery and other chemical tasks. This approach may help bridge the gap between probabilistic modeling and the rigid constraints of chemical structures.
Fourier Neural Operators for Rayleigh-Bรฉnard Convection
Theory
Efficient ML
Time Series
- Introduction of a lean FNO architecture that predicts increments for improved accuracy in modeling RBC.
- Demonstrated faster inference times and reduced parameter count compared to existing models.
- Ablation studies indicate that multi-layer 1D convolutional scaling operators outperform linear layers in accuracy.
- Model generalizes well across spatial and temporal resolutions but is limited by training data resolution.
Read more
Fourier Neural Operators for Rayleigh-Bรฉnard Convection
Summary
This paper presents an enhanced Fourier Neural Operator (FNO) designed for modeling two-dimensional Rayleigh-Bรฉnard convection (RBC) by predicting time increments rather than full solutions. This approach leads to improved accuracy compared to a standard FNO baseline while maintaining a compact architecture with 314k parameters and fast inference times of 7 ms. The study emphasizes the challenges of modeling turbulent convection, particularly at high Rayleigh numbers, where traditional numerical simulations are computationally expensive. The authors demonstrate that their lean FNO model can effectively generalize across different spatial and temporal resolutions, although its accuracy is ultimately constrained by the resolution of the training data. The proposed model is particularly suitable for integration with iterative numerical methods, enhancing its practical applicability in various fields, including atmospheric science and industrial processes.
Methodology
The authors developed a lean FNO architecture that predicts state increments over time, akin to time-stepping algorithms in mesh-based simulations. They conducted an ablation study comparing the performance of linear layers versus multi-layer 1D convolutional layers. Training data was generated using high-resolution simulations with Dedalus, and the model was trained using a relative L2 loss function. The performance was evaluated based on the accuracy of the reconstructed solutions.
Results
The proposed lean FNO model achieved significantly higher accuracy in predicting future states compared to traditional approaches. It maintained a compact parameter size and fast inference times, with a demonstrated ability to generalize across different mesh resolutions. However, the model exhibited faster error accumulation during longer auto-regressive rollouts, indicating a trade-off between accuracy and prediction length.
Implications
The findings suggest that the lean FNO model can serve as an efficient alternative to traditional numerical solvers for turbulent convection problems, with potential applications in various scientific and industrial domains. Its ability to predict increments rather than full solutions may lead to more efficient simulations in real-time applications.
Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains
Computer Vision
Optimization
- Identification of spatiotemporal leakage and hidden stratification as critical issues in standard evaluation methods.
- Introduction of Structure-Aware Stratified Partitioning (SASP) to create more reliable validation splits.
- Development of Curriculum Distributionally Robust Optimization (CDRO) to stabilize training under rigorous evaluation conditions.
- Demonstration of improved generalization and confidence calibration across multiple domains.
Read more
Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains
Summary
This paper addresses the limitations of traditional performance evaluation methods in machine learning, particularly in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging. The authors highlight two critical issues: spatiotemporal leakage, where correlated samples contaminate training and validation splits, leading to inflated performance estimates, and hidden stratification, where errors in minority subpopulations are masked by overall performance metrics. To tackle these challenges, the authors propose a unified framework that includes Structure-Aware Stratified Partitioning (SASP) and Curriculum Distributionally Robust Optimization (CDRO). SASP ensures that validation splits are semantically disjoint and class-balanced, significantly reducing leakage while maintaining representation of minority groups. CDRO enhances training stability by progressively focusing on difficult subgroups, thus improving model reliability under the new evaluation protocols. The proposed methods demonstrate improved generalization and more accurate confidence calibration across various benchmarks, revealing failure modes that traditional random-split evaluations overlook.
Methodology
The authors propose a two-part framework: SASP for dataset partitioning that ensures semantic disjointness and class balance, and CDRO for training that emphasizes difficult subgroups progressively. This approach is designed to mitigate the issues of data leakage and hidden stratification in spatially correlated datasets.
Results
The combination of SASP and CDRO leads to consistently improved generalization performance, more reliable confidence calibration, and the identification of failure modes that are typically obscured in conventional evaluation methods. The methods were validated across several benchmarks in aerial surveillance, precision agriculture, and medical imaging.
Implications
The proposed framework has significant implications for the evaluation and training of machine learning models in high-stakes environments, where accurate performance assessment is critical. It can be applied to various domains that involve spatially correlated data, ensuring more reliable model deployment.
Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations
Theory
Efficient ML
- Introduces FS-PIELM to mitigate spectral bias in high-frequency PDE solutions.
- Utilizes a novel weight initialization mechanism that shifts the mean of the weight distribution.
- Demonstrates significant accuracy improvements over existing physics-informed extreme learning machine variants.
- Maintains computational efficiency with only a single linear solve required.
Read more
Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations
Summary
This paper addresses the challenge of solving high-frequency partial differential equations (PDEs) using a novel framework called Frequency Shift Physics-Informed Extreme Learning Machine (FS-PIELM). Traditional neural networks often exhibit spectral bias, favoring low-frequency components, which hampers their ability to accurately model high-frequency solutions. The FS-PIELM framework introduces an innovative weight initialization mechanism that shifts the mean of the Gaussian weight distribution while maintaining a fixed variance, thus avoiding the variance amplification seen in conventional scaling methods. Two variants of FS-PIELM are proposed: FS-PIELM-L, which assigns independent frequency magnitudes to neurons, and FS-PIELM-G, which groups neurons for enhanced robustness. Theoretical analysis confirms that the frequency variance remains bounded and approaches unity, contrasting with the quadratic growth observed in traditional methods. The computational efficiency of extreme learning machines is preserved, requiring only a single linear solve. Experimental results on seven benchmark problems across six types of equations demonstrate that FS-PIELM significantly outperforms existing PIELM variants, achieving accuracy improvements ranging from one to nearly five orders of magnitude in six out of seven cases.
Methodology
The FS-PIELM framework employs a unique weight initialization strategy that shifts the mean of the Gaussian distribution of weights, rather than scaling them, to address spectral bias. Two variants are developed: FS-PIELM-L for independent frequency magnitudes and FS-PIELM-G for grouped neurons. The method retains the computational efficiency of extreme learning machines, requiring only a single linear solve for training.
Results
The FS-PIELM framework was tested on seven benchmark problems involving various PDE types, including Helmholtz, wave, Poisson, Klein-Gordon, heat, and advection-diffusion equations. The linear variant of FS-PIELM achieved the best accuracy in six out of seven cases, with improvements in accuracy ranging from one to nearly five orders of magnitude compared to existing PIELM variants.
Implications
The FS-PIELM framework has significant implications for computational science and engineering, particularly in fields requiring the solution of high-frequency PDEs. Its ability to accurately model complex phenomena with rapid oscillations could enhance simulations in fluid dynamics, solid mechanics, and other areas where traditional methods struggle.
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Large Language Models
Optimization
Theory
- Introduction of a three-term scaling law that incorporates model size, training steps, and batch size.
- The law can be fitted using fewer training runs, specifically leveraging suboptimal batch sizes.
- It provides a framework for deriving scaling laws for both optimal and suboptimal batch sizes.
- The proposed model aligns with empirical findings regarding critical batch sizes and their scaling.
Read more
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Summary
In this paper, Fabian Schaipp introduces a novel scaling law that integrates model size and training data, specifically distinguishing between training steps and batch size, termed the three-term law. This law is derived from empirical observations of training runs and provides a robust framework for optimizing batch size allocation. The author demonstrates that the three-term law accurately predicts the optimal batch size while also accommodating suboptimal configurations, thus requiring fewer training runs for fitting. The proposed model combines elements from existing scaling laws and optimization theory, allowing for a comprehensive understanding of how loss relates to model parameters. The findings indicate that the optimal batch size can be determined with only two batch sizes per (N, D) configuration, significantly reducing the number of required training runs. Additionally, the law successfully captures the scaling behavior of critical batch sizes and offers insights into ฮต-suboptimal batch sizes, making it applicable in practical scenarios where optimal configurations may not be feasible.
Methodology
The author proposes a power-law model for loss as a function of model size (N), batch size (M), and training steps (K). The model is empirically fitted using data from training runs of dense large language models, allowing for the derivation of optimal batch sizes and scaling laws for suboptimal configurations. The methodology emphasizes the integration of theoretical insights from optimization theory to enhance the understanding of hyperparameter scaling.
Results
The proposed scaling law successfully predicts optimal batch sizes consistent with previous empirical findings. It reduces the number of training runs needed for fitting to 28% by requiring only two batch sizes per configuration. The law also establishes non-trivial optimal batch sizes that are independent of model size and accurately describes the scaling of critical batch sizes.
Implications
This research has significant implications for the training of large language models, particularly in optimizing resource allocation and improving training efficiency. The findings can guide practitioners in selecting batch sizes and training steps, especially in scenarios with hardware constraints or limited computational resources.
Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking
Theory
- Demonstrates a coordinated manipulation strategy for crowdsourced fact-checking systems.
- Empirical findings indicate that low-quality notes can be artificially elevated above consensus thresholds.
- Introduces a counterintuitive property of the rating system where 'Not Helpful' ratings can increase helpfulness scores.
- Develops a cost model for quantifying manipulation efforts.
Read more
Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking
Summary
This paper investigates the vulnerabilities of crowdsourced fact-checking systems, particularly focusing on the matrix factorization algorithms used in platforms like X, Meta, TikTok, and Google. These systems aim to combat misinformation by requiring diverse agreement from users with differing perspectives. The authors present a novel adversarial attack that demonstrates how coordinated groups can manipulate the system to create synthetic consensus. Through a two-phase attack strategy, adversarial accounts can strategically boost a note's helpfulness score by leveraging the latent representations in the matrix factorization algorithm. The study reveals that up to 10.7% of lower quality notes could be manipulated above consensus thresholds with fewer than 10 coordinated ratings. Additionally, a counterintuitive finding shows that rating a note as 'Not Helpful' can paradoxically increase its helpfulness score. The paper also introduces a cost model for manipulation efforts and discusses the mitigations implemented in Xโs Community Notes to counteract these vulnerabilities.
Methodology
The authors conducted a theoretical and empirical analysis of the matrix factorization component of Community Notes. They developed a two-phase attack strategy to manipulate ratings and analyzed historical production data to quantify the extent of potential manipulation. The study also included a theoretical exploration of the rating dynamics within the system.
Results
The analysis revealed that coordinated users could manipulate the helpfulness scores of notes, with empirical evidence showing that up to 10.7% of lower quality notes could be elevated above consensus thresholds using fewer than 10 ratings. The theoretical analysis uncovered that certain rating behaviors could counterintuitively enhance a note's perceived helpfulness.
Implications
The findings highlight significant vulnerabilities in crowdsourced fact-checking systems, suggesting that without robust safeguards, these platforms could be exploited to spread misinformation. The insights could inform the design of more resilient algorithms and contribute to the development of effective mitigation strategies against coordinated manipulation.
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
NLP
Large Language Models
Efficient ML
- DALorRA introduces a new paradigm for uncertainty quantification in LLMs by focusing on low-rank adaptation.
- The method employs stochastic masking to dynamically adjust model capacity, reducing overfitting risks.
- Empirical results show that DALorRA provides excellent calibration and maintains high reasoning accuracy.
- The framework combines principles from variational Bayesian estimation and ensemble methods for effective uncertainty quantification.
Read more
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
Summary
This paper addresses the issue of overconfidence in large language models (LLMs) during task-specific fine-tuning, which can hinder their reliable deployment. The authors propose a novel framework called Data-Adaptive Lower-Rank Adaptation (DALorRA), which shifts the focus of uncertainty quantification from the dense parameter space to a more efficient low-rank adaptation (LoRA) framework. By introducing stochastic masking on rank dimensions, DALorRA enables Bayesian regularization of model capacity during training and ensemble-like calibration during inference. This approach allows for dynamic pruning of unnecessary rank components, thus avoiding overfitting while maintaining model performance. The authors conduct extensive experiments demonstrating that DALorRA achieves superior calibration of LLMs without compromising their reasoning accuracy, effectively bridging the gap between Bayesian neural networks and deep ensemble methods.
Methodology
The authors develop DALorRA by injecting a stochastic diagonal mask matrix into the LoRA framework, allowing for the modeling of rank-level uncertainty. This is achieved through variational inference, where the diagonal elements of the mask are treated as latent variables governed by learnable Bernoulli distributions. This method captures discrete structural uncertainty and enables data-adaptive lower-rank adaptation.
Results
Extensive experiments on various reasoning benchmarks indicate that DALorRA consistently outperforms existing methods in terms of uncertainty calibration while preserving reasoning accuracy. The results validate the effectiveness of the proposed framework in addressing the overconfidence issue in LLMs.
Implications
The findings suggest that DALorRA can enhance the reliability of LLMs in real-world applications by providing better uncertainty quantification. This could lead to more trustworthy deployments in critical areas such as healthcare, finance, and autonomous systems, where understanding model confidence is crucial.
Spin-Weighted Spherical Harmonics Enable Complete and Scalable E(3)-Equivariant Networks
Theory
Efficient ML
- Introduces SpinGTP to overcome expressivity limitations of existing tensor products in E(3)-equivariant networks.
- Utilizes Spin-Weighted Spherical Harmonics to capture antisymmetric interactions effectively.
- Achieves a computational complexity of O(L^3) while maintaining high expressivity.
- Demonstrates superior performance in tasks involving chiral materials and non-centrosymmetric geometries.
Read more
Spin-Weighted Spherical Harmonics Enable Complete and Scalable E(3)-Equivariant Networks
Summary
This paper addresses the limitations of E(3)-equivariant networks in modeling 3D atomistic systems, specifically the computational complexity and expressivity issues associated with the Clebsch-Gordan Tensor Product (CGTP) and its alternative, the Gaunt Tensor Product (GTP). While GTP reduces computational complexity from O(L^6) to O(L^3), it fails to capture essential antisymmetric interactions, leading to incomplete expressivity. The authors propose a novel approach called SpinGTP, which utilizes Spin-Weighted Spherical Harmonics (SWSH) to recover missing antisymmetric paths while maintaining the efficiency of GTP. This method allows for a more expressive equivariant basis that includes parity-odd components. The authors evaluate SpinGTP on various benchmarks, demonstrating that it achieves accuracies comparable to full CGTP and significantly outperforms in tasks involving chiral materials and non-centrosymmetric geometries. Overall, SpinGTP represents a significant advancement in the development of scalable and mathematically rigorous equivariant networks for large-scale 3D atomistic simulations.
Methodology
The authors developed SpinGTP by generalizing the Gaunt Tensor Product to incorporate Spin-Weighted Spherical Harmonics. This approach leverages the algebraic properties of SWSH to recover antisymmetric interactions that were previously lost in GTP. The implementation includes a real, parity-labeled SWSH basis and specialized equivariant layers, allowing for efficient computation and representation of complex interactions in 3D systems.
Results
SpinGTP was evaluated across multiple benchmarks, including Tetris, 3BPA, SPICE-MACE-OFF, and OC20. The results indicate that SpinGTP achieves accuracies comparable to full CGTP while explicitly capturing antisymmetric paths, leading to enhanced performance in tasks related to chiral materials and geometries.
Implications
The development of SpinGTP has significant implications for the field of computational materials science and molecular modeling, enabling more accurate simulations of complex 3D atomistic systems. This method can improve the understanding of materials with chiral properties and non-centrosymmetric structures, potentially impacting the design of new materials and drugs.
Revisiting Decentralized Online Convex Optimization with Compressed Communication
Optimization
Theory
Efficient ML
- Introduction of two FTRL-type algorithms for D-OCO with compressed communication.
- First algorithm matches existing regret bounds in full-information settings.
- Second algorithm significantly improves regret bounds and communication costs in bandit settings.
- Simplified analysis and design compared to previous OGD-based approaches.
Read more
Revisiting Decentralized Online Convex Optimization with Compressed Communication
Summary
This paper addresses the decentralized online convex optimization (D-OCO) problem, focusing on the challenges posed by communication bottlenecks in distributed systems. The authors propose two novel algorithms based on the follow-the-regularized-leader (FTRL) framework that utilize compressed communication, marking a significant advancement over existing online gradient descent (OGD) variants. The first algorithm operates in a full-information setting and matches the best known regret bounds, while the second algorithm is tailored for a bandit setting, achieving improved regret bounds and reduced communication costs compared to prior work. The key insight is the effective use of dual updates in FTRL, which simplifies the application of consensus techniques under communication constraints. The proposed algorithms demonstrate both elegance in design and robustness in performance, paving the way for more efficient decentralized optimization in streaming data scenarios.
Methodology
The authors develop two algorithms based on the FTRL framework, leveraging dual variable updates to facilitate average consensus with compressed communication. The first algorithm is designed for full-information scenarios, while the second addresses the bandit setting, where only loss values are revealed. Both algorithms incorporate techniques from previous studies on decentralized optimization and communication compression.
Results
The first algorithm achieves regret bounds comparable to existing methods for full-information settings. The second algorithm improves upon the best known regret bounds for bandit settings, reducing the communication rounds required to achieve these bounds. Specifically, it achieves O(nT^(3/4)) and O(nT^(2/3)(log T)^(1/3)) for convex and strongly convex functions, respectively, with significantly fewer communication rounds than previous algorithms.
Implications
The proposed algorithms enhance the efficiency of decentralized online convex optimization, making them suitable for applications in distributed systems where communication is limited or costly. This work could lead to advancements in various fields, including machine learning, data streaming, and real-time decision-making in networks.
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
NLP
Large Language Models
Efficient ML
- Introduction of Program-as-Weights (PAW) for fuzzy function programming.
- PAW compiles natural language specifications into efficient neural binaries.
- Demonstrated significant performance improvements with a smaller interpreter.
- Five case studies illustrate practical applications of PAW in various fuzzy tasks.
Read more
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
Summary
The paper introduces a novel programming paradigm called Program-as-Weights (PAW) aimed at addressing the challenges of implementing fuzzy functionsโtasks that resist precise rule-based programming. Traditional programming methods often fall short for tasks such as alerting on important log lines or repairing malformed JSON, leading developers to rely on large language model (LLM) APIs, which can be costly and lack reproducibility. PAW proposes a three-step approach: developers describe the function in natural language, a neural compiler converts this description into a compact neural binary, and a lightweight interpreter executes the binary locally. The authors present a 4B compiler trained on a newly released dataset, FuzzyBench, which contains 10 million examples of fuzzy tasks. The PAW system demonstrates efficiency by using a 0.6B Qwen3 interpreter that outperforms direct prompting of a larger 32B model while consuming significantly less memory. The paper also showcases five case studies illustrating the practical applications of PAW, emphasizing its potential to facilitate local execution of fuzzy functions without the need for constant API calls. Furthermore, the framework's modality generality is highlighted, showing its adaptability to different types of tasks, including those involving images.
Methodology
The PAW paradigm consists of a two-stage compile pipeline using a 4B Qwen3 model. The first stage involves a pseudo-compiler that transforms user specifications into a clean pseudo-program. The second stage is a trained LoRA compiler that generates a parameter-efficient module from the pseudo-program. The authors also release FuzzyBench, a dataset for training and evaluating fuzzy function implementations.
Results
The PAW system, utilizing a 0.6B Qwen3 interpreter, achieves a 73.78% exact match on fuzzy tasks, outperforming direct prompting of a 32B model (68.70% exact match) while using approximately one-fiftieth of the inference memory. The system runs efficiently at 30 tokens per second on a MacBook M3.
Implications
PAW has the potential to revolutionize how developers approach fuzzy programming tasks by enabling local execution of functions, reducing reliance on external APIs, and improving reproducibility and efficiency in software development.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
NLP
Large Language Models
Efficient ML
- First systematic study on the impact of expert pruning on factual reliability in high-stakes biomedical settings.
- Moderate pruning preserves utility while extreme pruning increases hallucination risks.
- Utility and reliability degrade rapidly in general-domain tasks compared to in-domain tasks.
- Safe compression of MoE models is highly task- and domain-dependent.
Read more
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
Summary
This paper investigates the impact of structured expert pruning on the utility and factual reliability of Mixture-of-Experts (MoE) models, particularly in the biomedical domain. While MoE models enhance inference speed by activating only a subset of experts, they also require significant memory resources. The authors evaluate four MoE models using six different pruning methods and various pruning ratios across both generation and classification tasks. The study reveals that moderate pruning can maintain in-domain utility without immediate declines in reliability, although extreme pruning increases the risk of hallucinations. Furthermore, the performance degrades significantly when transitioning to general-domain tasks. The findings emphasize that evaluating pruned MoE models solely based on utility is inadequate for high-stakes applications, highlighting the necessity of reliability assessments in such contexts.
Methodology
The authors conducted a systematic evaluation of four MoE models, applying six pruning methods at various ratios. They assessed the models' performance on generation and classification tasks, comparing results in both biomedical and general domains to analyze the effects of pruning on utility and reliability.
Results
The study found that moderate pruning maintains in-domain utility without immediate reliability decline, while extreme pruning ratios lead to increased hallucination risks. In general-domain tasks, both utility and reliability showed rapid degradation, indicating that the effectiveness of pruning is highly dependent on the specific task and domain.
Implications
The findings suggest that for deploying MoE models in critical areas like biomedicine, it is essential to balance pruning for efficiency with the need for factual reliability. This has implications for model deployment strategies in high-stakes environments, where errors can have serious consequences.
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Large Language Models
NLP
Optimization
- HERMES provides a hierarchical labeling substrate that allows for dynamic granularity control in data mixtures.
- The methodology utilizes a Learned Semantic Transform and a three-stage RVQ for efficient document annotation.
- Performance improvements were observed by adjusting sampling strategies based on granularity, demonstrating the interplay between granularity and sampling methods.
- HERMES allows for the annotation of approximately 50 million documents into up to 130,000 cells without the need for re-clustering.
Read more
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Summary
The paper introduces HERMES, a novel hierarchical labeling substrate designed to enhance the pre-training of machine learning models through improved data mixture strategies. Traditional data-mixing methods rely on fixed label systems that limit the granularity and flexibility of data representation. HERMES addresses this limitation by providing a data-derived hierarchical labeling system that allows for multi-granularity annotations of documents. The methodology involves a Learned Semantic Transform followed by a three-stage residual vector quantization (RVQ) process, enabling the annotation of documents into a coarse-to-fine code structure. This hierarchical approach allows users to control granularity dynamically without the need for re-clustering. The authors demonstrate that HERMES can effectively expose interactions between granularity and sampling strategies, leading to improved performance in various tasks. The results indicate that switching sampling strategies at different granularities can significantly impact model performance, highlighting the importance of a flexible labeling system in data mixture design.
Methodology
HERMES employs a Learned Semantic Transform followed by a three-stage residual vector quantization (RVQ) process to annotate documents into a hierarchical code structure. This allows for the generation of multiple granularities from a single trained codebook, facilitating flexible data mixture designs.
Results
The implementation of HERMES showed that at a fixed granularity, changing the sampling strategy from max-entropy coverage to quality top-30% resulted in a performance increase of +0.0253 in a 16-task capability macro-average. However, this advantage diminished at finer granularities, indicating that candidate pool size impacts the effectiveness of sampling strategies.
Implications
HERMES has the potential to significantly improve the design of data mixtures in pre-training large language models, allowing for more effective utilization of heterogeneous data sources. This could lead to better model performance across various tasks and applications in natural language processing.
Conditional Inference Trees and Forests for Feature Selection
Theory
Efficient ML
- CIT and CIF effectively reduce split-selection bias in feature selection.
- CIF ranks highly among various classification and regression methods in benchmark studies.
- Adaptive stopping and threshold search parameters significantly influence runtime efficiency.
- High-dimensional simulations reveal potential weaknesses in feature recovery using CIF.
Read more
Conditional Inference Trees and Forests for Feature Selection
Summary
This paper investigates Conditional Inference Trees (CIT) and Conditional Inference Forests (CIF) as methods for feature selection, focusing on their ability to reduce split-selection bias through a two-stage process. The authors highlight the computational challenges associated with these methods due to repeated permutation tests and exhaustive threshold searches. They present a benchmark study that evaluates CIT and CIF against various classification and regression methods across multiple datasets. The study reveals that CIF ranks competitively among other methods, achieving 4th place in classification and 3rd in regression tasks. The authors also explore the impact of runtime hyperparameters, demonstrating that adaptive stopping and the number of thresholds searched significantly affect runtime without compromising ranking quality. Additionally, the paper discusses the limitations of CIF in high-dimensional settings, where informative features may be overlooked due to forest feature sampling. Overall, the findings support the use of CIF as a robust top-k feature-ranking method for downstream prediction tasks.
Methodology
The authors employed a benchmark study to evaluate CIT and CIF as feature selection methods, comparing their performance against other classification and regression techniques. They utilized real-data benchmarks, runtime ablation studies, and synthetic feature-recovery experiments to assess the effectiveness and efficiency of these methods. The study included a fixed-node theorem to guarantee the validity of the feature selection process under certain conditions.
Results
CIF demonstrated strong performance, ranking 4th among 17 classification methods and 3rd among 18 regression methods across multiple datasets. The runtime ablation studies indicated that disabling adaptive stopping and using exact threshold searches significantly increased fitting times, while only marginally affecting downstream score changes. The analysis also revealed that in high-dimensional settings, CIF may fail to utilize informative features effectively due to its sampling strategy.
Implications
The findings suggest that CIF can be a valuable tool for feature selection in various predictive modeling tasks, particularly in scenarios where reducing split-selection bias is critical. However, the limitations identified in high-dimensional contexts highlight the need for careful consideration of feature sampling strategies to ensure informative features are not overlooked.
Ask the Right Comparison: Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges
NLP
Large Language Models
Theory
- Introduces a bias-aware Bayesian model for evaluating LLM judges that accounts for verbosity and position biases.
- Develops a top-k-aware active acquisition rule that optimizes the selection of comparisons to identify the top-k items efficiently.
- Demonstrates that naive aggregation methods can lead to incorrect top-k rankings due to inherent biases in LLM judges.
- Shows significant improvements in recall rates for biased judges, with performance gains concentrated on lower-tier models.
Read more
Ask the Right Comparison: Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges
Summary
This paper addresses the challenges of using large language models (LLMs) as judges for pairwise comparisons in ranking tasks, particularly focusing on the biases these models exhibit. The authors propose a Bayesian framework that incorporates judge-specific biases, such as verbosity and position effects, to recover the latent quality of items being compared. They introduce a novel active acquisition strategy that prioritizes comparisons that reduce uncertainty about the top-k items rather than the overall ranking. The methodology is validated through experiments with sixteen different LLMs on a controlled benchmark, demonstrating that naive aggregation methods fail to identify the correct top-k due to biases, while the proposed bias-aware model successfully recovers it. The results indicate that the proposed approach significantly improves recall, especially for cheaper, more biased judges, while maintaining robustness against judges with minimal bias.
Methodology
The authors model the judging process as a Bayesian Bradley-Terry process, incorporating observable covariates related to bias. They employ a shrinkage prior to adaptively learn the biases exhibited by each judge. The active acquisition strategy is designed to focus on comparisons that most effectively clarify the membership of items in the top-k set, rather than attempting to resolve the entire ranking.
Results
The experiments reveal that naive aggregation methods plateau at incorrect top-k rankings, while the bias-aware model accurately identifies the correct top-k. The top-k-aware acquisition strategy achieves this with fewer comparisons compared to traditional methods. The analysis shows a strong correlation between judge competence and the effectiveness of bias correction, with significant improvements in recall for biased judges.
Implications
This work has implications for various applications where LLMs are used for evaluation and ranking tasks, such as model selection, content generation, and document triaging. By addressing biases in LLM judgments, the proposed methods can enhance the reliability and accuracy of automated evaluations in natural language processing tasks.
CALM: Interpretable Cross-Modal Alignment for Biomarker Discovery from Unpaired Data
Multimodal
- CALM enables biomarker discovery from unpaired neuroimaging and genetic datasets.
- The framework uses linear projections for cross-modal alignment in a shared latent space.
- It outperforms existing methods and shows stability in associations across validation folds.
- CALM reveals significant immune and metabolic pathways associated with autism spectrum disorder.
Read more
CALM: Interpretable Cross-Modal Alignment for Biomarker Discovery from Unpaired Data
Summary
The paper introduces CALM, a novel framework designed to discover interpretable associations between neuroimaging and genetic data from completely unpaired datasets. This is particularly relevant in the context of neuropsychiatric disorders, where understanding the interplay between brain structure and genetic influences is crucial for biomarker identification. Traditional methods often require paired samples, which are impractical due to the disjoint nature of existing large-scale datasets. CALM addresses this by employing linear projections to align the two modalities in a shared latent space, optimizing for class-conditional alignment and group separability. The framework is validated through experiments on autism spectrum disorder (ASD), demonstrating its ability to uncover immune and metabolic pathways linked to specific cortical regions. The results indicate that CALM outperforms state-of-the-art methods and maintains stability in learned associations, thus paving the way for leveraging large unimodal repositories for cross-modal biomarker discovery.
Methodology
CALM employs pretrained modality-specific encoders to generate latent representations from neuroimaging and genetic data. It utilizes a loss function that combines class-conditional alignment and contrastive loss to ensure that similar diagnostic groups are aligned while maintaining group separability. The framework is validated by training on unimodal datasets and testing on a paired dataset, demonstrating its effectiveness in recovering biologically meaningful associations.
Results
The experiments conducted on autism spectrum disorder data show that CALM successfully identifies immune and metabolic pathways linked to specific brain regions, consistent with existing literature. The framework outperforms several state-of-the-art methods and ablation baselines, indicating its robustness and reliability in cross-modal analysis.
Implications
CALM has significant implications for biomarker discovery in neuropsychiatric disorders, allowing researchers to utilize large unimodal datasets for insights into cross-modal interactions. This could lead to improved understanding and identification of biomarkers for conditions like autism spectrum disorder, ultimately enhancing precision psychiatry.
Optimizing Visual Generative Models via Distribution-wise Rewards
Generative Models
Reinforcement Learning
Computer Vision
- Introduction of distribution-wise rewards to improve generative model training.
- Mitigation of reward hacking and mode collapse issues common in sample-wise reward systems.
- Development of a subset-replace strategy for efficient reward computation.
- Demonstrated significant improvements in FID scores across multiple models.
Read more
Optimizing Visual Generative Models via Distribution-wise Rewards
Summary
This paper addresses the limitations of conventional reinforcement learning (RL) strategies in visual generative models, which often rely on sample-wise reward functions. Such approaches can lead to reward hacking, reducing image diversity and introducing visual anomalies. The authors propose a novel framework that utilizes distribution-wise rewards to fine-tune generative models, aligning them more closely with real-world data distributions. By focusing on the overall data distribution rather than individual samples, the method mitigates the mode collapse problem, where generated samples converge towards similar outputs. To efficiently compute these distribution-wise rewards without incurring high computational costs, the authors introduce a subset-replace strategy that updates only a small subset of a generated reference set. Additionally, they apply RL to optimize post-hoc model merging coefficients, addressing inconsistencies between training and inference phases caused by stochastic differential equations (SDEs). Experimental results demonstrate significant improvements in the Frรฉchet Inception Distance (FID) metric across various base models, indicating enhanced image quality and diversity. Qualitative evaluations further confirm that the proposed method improves perceptual quality while maintaining sample diversity.
Methodology
The authors propose a reinforcement learning framework that employs distribution-wise rewards instead of sample-wise rewards. They utilize a subset-replace strategy to efficiently compute these rewards, focusing on the overall distribution of generated samples. The method also includes optimizing model merging coefficients to address inconsistencies between training and inference phases.
Results
The proposed method significantly reduced FID scores from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2, indicating improved alignment with real-world data distributions. Qualitative assessments confirmed enhancements in both image quality and diversity.
Implications
This work suggests that using distribution-wise rewards can lead to more robust generative models that better capture the diversity of real-world data. It opens avenues for further research in optimizing generative models and improving their alignment with human preferences.
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning
Reinforcement Learning
Efficient ML
- Introduction of the 'rollout infrastructure tax' concept, highlighting the impact of execution substrate on coding-agent RL efficiency.
- Significant variations in performance metrics (cold-start latency and worker-hours) across different execution substrates.
- The necessity for future coding-agent RL systems to optimize execution substrates as part of the training process.
- Identification of specific design requirements for effective rollout-native substrates.
Read more
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning
Summary
This paper addresses the overlooked aspect of execution infrastructure in coding-agent reinforcement learning (RL), which significantly impacts efficiency during the rollout phase. The authors introduce the concept of the 'rollout infrastructure tax,' which refers to the latency and costs incurred by the systems executing coding-agent RL trajectories. They conduct a comparative study of four execution substrates: single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud virtual machines. The findings reveal substantial variations in cold-start latency (up to 110ร) and projected worker-hours for large-scale rollouts (1.8ร spread for one million 150-step trajectories). The authors argue that optimizing execution substrates should be integral to the training system, rather than merely a deployment consideration. They propose design requirements for rollout-native substrates, emphasizing the need for locality-aware warm pools, low-latency action APIs, and appropriate isolation mechanisms based on rollout size.
Methodology
The authors conducted a controlled evaluation comparing four common execution substrates under identical coding-agent workloads. They measured various components contributing to the rollout infrastructure tax, including environment creation time, readiness time, per-action costs, and orchestration overhead.
Results
The study found that cold-start latency varied by up to 110ร across different substrates, and for one million 150-step trajectories, the choice of substrate resulted in a 1.8ร spread in projected worker-hours, equating to an additional 5,316 worker-hours. These results underscore the importance of substrate choice in coding-agent RL performance.
Implications
The findings suggest that optimizing execution infrastructure can lead to significant efficiency gains in coding-agent RL systems, particularly as workloads scale. This could influence the design of future RL systems and deployment strategies, making infrastructure considerations a core aspect of RL training.
Rank-Then-Act: Reward-Free Control from Frame-Order Progress
Reinforcement Learning
Computer Vision
Robotics
- RTA enables learning control policies from video without environment rewards.
- Introduces a correlation-based reward signal using Spearman rank correlation.
- Demonstrates strong performance across various control tasks and environments.
- Single pretrained progress scorer shows effective transferability across tasks.
Read more
Rank-Then-Act: Reward-Free Control from Frame-Order Progress
Summary
The paper presents Rank-Then-Act (RTA), a novel framework for learning control policies from expert video demonstrations without relying on environment rewards. RTA employs a VisionโLanguage Model (VLM) trained offline as a progress-based ordinal scorer using a Group Relative Policy Optimization (GRPO) objective on shuffled frame sequences. This approach encourages the model to recover temporal ordering based on visual semantics rather than simple time cues. Instead of using the VLM as a scalar reward model, RTA introduces a correlation-based reward function for reinforcement learning, which computes the Spearman rank correlation between predicted progress rankings and true temporal indices within a sliding window. This design allows for a stable, scale-invariant learning signal that is decoupled from absolute calibration, facilitating effective transfer across tasks and environments. The framework is evaluated on both discrete and continuous control benchmarks, demonstrating that RTA consistently matches or outperforms existing video-based reward learning methods and rank-based baselines. The results indicate that correlation-structured supervision derived from video ordinal signals is sufficient for policy learning, providing a scalable alternative to traditional reward design.
Methodology
The methodology involves two main stages: First, a VisionโLanguage Model (VLM) is trained offline on shuffled video clips using a Group Relative Policy Optimization (GRPO) objective to produce ordinal progress rankings. Second, a correlation-based reward signal is computed using Spearman rank correlation between predicted rankings and true temporal indices over a sliding window, which drives policy learning without requiring absolute reward calibration.
Results
RTA was evaluated on discrete control benchmarks (PyBoy: Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld), consistently matching or outperforming prior video-based reward learning methods and rank-based baselines. The results highlight the effectiveness of the correlation-based reward signal in enabling control policy learning in reward-free environments.
Implications
The findings suggest that RTA could be applied in various domains where reward design is challenging, such as robotics and video game environments. The framework's ability to learn from video without explicit rewards could lead to more generalist agents capable of adapting to diverse tasks and environments.
Finite-Lag Operator Geometry of Recurrent Representations
Theory
Time Series
- Introduces finite-lag operator geometry for recurrent representations, emphasizing temporal dynamics.
- Develops a conditional transport law and a source-centered transport tensor that captures the geometry of recurrent states.
- Proves structural results including affine covariance and stability of estimators on bounded trajectory clouds.
- Demonstrates the framework's effectiveness in detecting deterministic recurrent motion not visible to traditional methods.
Read more
Finite-Lag Operator Geometry of Recurrent Representations
Summary
This paper introduces a novel framework for analyzing recurrent representations in machine learning through the lens of finite-lag operator geometry. Traditional methods often treat representations as static point clouds, but this work emphasizes the importance of the temporal dynamics inherent in recurrent hidden states. The author develops a conditional transport law, Qโ(dy | x), which captures the relationship between source and successor states over a fixed lag. From this, a source-centered transport tensor, Gโ, is derived, providing a decomposition into conditional spread and coherent displacement. Additionally, an antisymmetric coordinate circulation, Wฯโ, is introduced to summarize directed flow in the representation space. The paper proves several structural results, including affine covariance and stability of the dense Gaussian estimator, and demonstrates that deterministic recurrent motion can be detected even when traditional infinitesimal methods fail. Controlled experiments validate the theoretical findings, revealing architecture-dependent differences in transport scale and coherent displacement in performance-matched networks. This framework offers a fresh perspective on recurrent dynamics, moving beyond linear-Gaussian assumptions and providing a robust geometric interpretation of recurrent computations.
Methodology
The methodology involves defining a finite-lag conditional transport law based on observed source-successor pairs, estimating it using a dense Gaussian source-smoothing operator. The resulting transport tensor is analyzed for its geometric properties, and various structural results are derived to establish its stability and covariance characteristics.
Results
The main results include the successful decomposition of the transport tensor into conditional spread and coherent displacement, the introduction of an antisymmetric circulation statistic, and the validation of these concepts through controlled experiments. The framework reveals significant differences in transport characteristics across different network architectures.
Implications
This work has implications for understanding the dynamics of recurrent neural networks, providing tools for analyzing their behavior in a geometric context. It may enhance the design and interpretation of recurrent architectures in various applications, including time series prediction and sequential data modeling.
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis
Time Series
- ZEUS is a unified TSFM that operates without task-specific fine-tuning.
- It incorporates a multi-scale Transformer architecture to balance granularity and scalability.
- MOTM enables ZEUS to accommodate diverse task-specific inductive biases.
- ZEUS achieves competitive performance across five downstream tasks in a tuning-free manner.
Read more
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis
Summary
The paper introduces ZEUS, a unified tuning-free Time Series Foundation Model (TSFM) designed to enhance performance across various time series analysis tasks without the need for task-specific fine-tuning. ZEUS addresses two major challenges in multi-task generalization: the architectural dilemma between point-level granularity and long-sequence scalability, and the training dilemma posed by divergent inductive biases across different tasks. To tackle these issues, ZEUS employs a multi-scale Transformer architecture that utilizes point-wise tokenization and a U-shaped hierarchy, allowing it to balance fine-grained detail with computational efficiency. Additionally, it introduces Multi-Objective Temporal Masking (MOTM), a strategy that supports various tasks such as extrapolation, interpolation, and global abstraction within a single framework. Extensive experiments demonstrate that ZEUS achieves state-of-the-art performance across five representative tasks, showcasing its potential as a general-purpose TSFM capable of out-of-the-box deployment across diverse downstream applications.
Methodology
ZEUS employs a U-shaped multi-scale Transformer architecture that utilizes point-wise tokenization to preserve fine-grained temporal details while ensuring computational efficiency for long sequences. It integrates Multi-Objective Temporal Masking (MOTM) to expose the model to various corruption patterns during pretraining, allowing it to learn a versatile representation space suitable for multiple tasks.
Results
The experimental results indicate that ZEUS consistently outperforms existing task-specific models in tuning-free settings across five key benchmarks, including point forecasting, probabilistic forecasting, anomaly detection, imputation, and classification. This performance underscores ZEUS's capability as a general-purpose TSFM.
Implications
The development of ZEUS has significant implications for time series analysis, enabling practitioners to deploy a single model across various tasks without the need for extensive fine-tuning. This could streamline workflows in fields such as finance, healthcare, and environmental monitoring, where time series data is prevalent.
The risk of KV cache compression
NLP
Large Language Models
Theory
Efficient ML
- Characterizes the minimax risk of KV cache compression, providing a theoretical foundation for its design.
- Identifies the intrinsic compressibility of KV caches based on future query interactions.
- Proposes novel design principles for efficient KV compression during autoregressive decoding.
- Instantiates these principles in a practical algorithm that shows promising performance on LongBench.
Read more
The risk of KV cache compression
Summary
This paper addresses the challenges of KV cache compression in Transformer models, which is essential for efficient inference on long sequences. The authors identify that while KV cache compression can significantly reduce memory and runtime costs, existing methods are primarily based on empirical experimentation without a systematic theoretical framework. They bridge this gap by characterizing the minimax risk of KV cache compression, providing insights into when and how accurate compression is feasible. The authors establish a graded account of compressibility based on the interaction between the cache and future queries, leading to novel design principles for efficient KV compression under causal masking. They propose a practical algorithm based on these principles and evaluate its performance on LongBench, demonstrating promising results. Overall, the paper advances the understanding of KV cache compression, offering both theoretical guarantees and practical methodologies.
Methodology
The authors recast KV cache compression as a sparse approximation problem, formalizing its minimax risk. They derive upper and lower bounds on this risk based on the complexity of future queries and develop design principles for efficient compression. A practical algorithm is instantiated and evaluated through targeted experiments.
Results
The proposed algorithm demonstrates promising performance on LongBench, achieving minimax-optimal compression risk. The theoretical bounds established provide a clear understanding of the conditions under which accurate KV cache compression is possible.
Implications
The findings have significant implications for the design of efficient Transformer models, particularly in applications requiring long-context processing. The theoretical insights can guide future research in KV cache compression and related areas, potentially leading to more efficient machine learning models.
Single-Channel EEG-Based Cognitive Load Assessment in Online Learning: A Hybrid Deep Learning Approach
Time Series
- Demonstrates the potential of single-channel EEG for cognitive load assessment in online learning.
- Achieves up to 78.5% accuracy using a hybrid CNN+LSTM+Attention model, outperforming traditional classifiers.
- Advocates for subject-independent evaluation to ensure model generalizability.
- Provides a reproducible evaluation pipeline and an open visualization tool for educators.
Read more
Single-Channel EEG-Based Cognitive Load Assessment in Online Learning: A Hybrid Deep Learning Approach
Summary
This paper investigates the feasibility of using a single-channel EEG device, specifically the NeuroSky MindWave Mobile 2, to assess cognitive load during online learning. The authors aim to distinguish between easy and difficult educational video content by employing a hybrid deep learning model that integrates Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, and attention mechanisms. The study utilizes a dataset from Wang et al., involving nine learners, and reports a maximum accuracy of 78.5% in a within-subject evaluation, significantly outperforming conventional feature-based classifiers which achieved only 55%. The authors emphasize the importance of subject-independent evaluation for generalizability and provide a reproducible evaluation pipeline, including leave-one-subject-out cross-validation and significance testing. Additionally, they introduce an open-source visualization tool that allows educators to see cognitive load estimates as a heatmap over the video timeline, aiding in identifying challenging segments for learners. The work is framed as a feasibility study rather than a fully deployable system, highlighting the need for further research and validation with larger datasets.
Methodology
The authors developed a hybrid model combining CNN, LSTM, and attention mechanisms, utilizing both raw EEG waveforms and band-power features. They conducted experiments on a dataset of nine learners, implementing regularization techniques to mitigate overfitting and improve validation accuracy.
Results
The hybrid model achieved a maximum accuracy of 78.5% in distinguishing cognitive load levels in a within-subject setting. In contrast, conventional classifiers only reached 55%. Regularization techniques helped stabilize validation accuracy between 68% and 73%. The authors caution that these results may be optimistic due to the small sample size.
Implications
This research suggests that EEG technology can provide valuable insights into cognitive load during online learning, potentially guiding instructional design and interventions. The open-source tools developed could facilitate further research and practical applications in educational settings.
EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction
Graph Learning
Time Series
Optimization
- Introduction of a heterogeneous hypergraph representation for object-centric next activity prediction.
- Development of a micro-spatial encoder that models the asymmetric roles of events and objects.
- Design of a macro-evolution encoder that captures inter-event timing and global execution patterns.
- EHHN achieves state-of-the-art performance on OCEL benchmarks, outperforming nine baseline methods.
Read more
EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction
Summary
The paper presents EHHN, an innovative approach for next activity prediction in service-oriented processes, which often involve multiple typed business objects. Traditional methods typically rely on single-case event logs, which fail to capture the complexity of interactions among various objects. EHHN addresses this limitation by utilizing object-centric event logs (OCELs) that represent events alongside all involved objects. The authors propose a heterogeneous hypergraph representation where event-object hyperedges connect events with their co-participating objects, and lifecycle hyperedges group events related to a primary object. EHHN employs a dual-stream architecture: a micro-spatial stream that models object-state evolution driven by events, and a macro-evolution stream that captures temporal dynamics through global prototypes. This architecture allows for a more comprehensive understanding of the relationships and timing between events and objects, leading to improved prediction accuracy. The experimental results demonstrate that EHHN outperforms existing methods, achieving significant improvements in accuracy and macro F1-score across multiple datasets while also reducing peak GPU memory usage.
Methodology
EHHN utilizes a dual-stream architecture with a heterogeneous hypergraph representation. The micro-spatial stream models event-driven object-state evolution, while the macro-evolution stream employs time-aware attention to capture temporal dynamics and retrieve global execution patterns. This combined approach allows for effective next activity prediction by integrating local and global information.
Results
EHHN demonstrated superior performance on four public OCEL benchmarks, achieving the highest accuracy and macro F1-score compared to nine baseline models. The improvements ranged from 8.1 to 12.4 percentage points over the strongest baseline, and EHHN also reduced peak GPU memory usage by up to 24 times.
Implications
The findings suggest that EHHN can significantly enhance predictive process monitoring in service-oriented applications, enabling better resource allocation and risk management. Its ability to handle complex interactions among multiple objects opens avenues for more sophisticated predictive analytics in various domains.
Self-explainable Operator Learning for Discovering Spatial Patterns in Functional Data
Interpretability
- Introduces a self-explainable operator learning framework for functional data.
- Enhances interpretability by linking input regions to output predictions.
- Demonstrates effectiveness in fluid flow problems, revealing spatial feature importance.
- Offers a transparent alternative to traditional opaque neural network-based models.
Read more
Self-explainable Operator Learning for Discovering Spatial Patterns in Functional Data
Summary
This paper presents a novel self-explainable operator learning framework aimed at enhancing interpretability in modeling complex physical systems within functional spaces. Traditional operator learning methods, particularly those based on neural networks, often lack transparency, making it difficult to understand the reasoning behind their predictions. The authors propose a reformulation of operator learning as a linear combination of generalized functional linear models expressed through integral equations. By leveraging the additive decomposability of these equations, the framework divides the input domain into subdomains, allowing for the computation of localized integrals that clarify the contribution of each region to the final prediction. This approach not only provides direct interpretability but also links specific input regions to corresponding output patterns, revealing the spatial features that influence predictions. The framework is validated through function-to-scalar and function-to-function mappings in fluid flow problems, such as blood flow and unsteady aerodynamics. The results indicate that the model prioritizes regions with strong feature gradients, offering insights into the decision-making process. Compared to existing post-hoc explainability methods, this framework embeds explainability directly within the operator structure, enhancing trust in machine learning applications for scientific analysis.
Methodology
The authors reformulate operator learning as a linear combination of generalized functional linear models using integral equations. They exploit the additive decomposability of these equations to compute localized integrals over subdomains, allowing for an interpretable evaluation of contributions from different input regions.
Results
The framework successfully identifies and prioritizes regions with strong feature gradients in fluid flow problems, providing meaningful insights into the model's decision-making process. Comparisons with established post-hoc methods show qualitative agreement, but the proposed approach offers inherent explainability without requiring additional tools.
Implications
This framework has the potential to foster trust in machine learning applications within scientific domains by providing interpretable models that can uncover relationships in complex physical systems. It may also facilitate more informed data-driven analyses and enhance the usability of machine learning in scientific research.
Denser $
eq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
NLP
Large Language Models
Reinforcement Learning
- On-policy self-distillation can enhance specialization but is fragile and prone to forgetting.
- SDPO shows weaker retention compared to traditional on-policy reinforcement learning methods like GRPO.
- Increased supervision density can lead to sensitivity and accumulated artifacts, complicating continual learning.
- The study highlights the importance of teacher stability and token reliability in the effectiveness of self-distillation.
Read more
Denser $
eq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
Summary
This paper investigates the effectiveness of on-policy self-distillation for continual post-training in large language models (LLMs). The authors challenge the prevailing notion that on-policy learning, particularly through self-distillation policy optimization (SDPO), inherently mitigates forgetting while enhancing specialization. The study reveals that while SDPO can accelerate in-domain specialization when teacher signals are stable, it struggles with generalization to out-of-distribution scenarios and exhibits increased forgetting and potential collapse during continual post-training. The authors analyze the impact of supervision density on performance, demonstrating that denser self-distillation can lead to larger parameter and response drift, amplifying artifacts through a self-reinforcing loop. The findings suggest that on-policy data alone is insufficient for effective continual learning, and that SDPO should not be viewed as a default stabilizer for continual post-training. The paper emphasizes the need to differentiate between the source of data and the training objective to better understand the dynamics of continual learning.
Methodology
The authors conducted experiments comparing self-distillation policy optimization (SDPO) with sequence-level reward optimization (GRPO) in both single-domain and multi-domain continual settings. They varied supervision density and assessed performance on in-distribution and out-of-distribution benchmarks to evaluate specialization, retention, and transfer. The analysis included diagnostics for parameter drift, response drift, and collapse modes.
Results
The results indicate that while SDPO can significantly improve performance on current tasks, it also increases the risk of forgetting and model collapse. The study found that forgetting is particularly pronounced on tasks that are misaligned with the current training domain, and that the benefits of SDPO are contingent on the stability of teacher signals and the reliability of token-level supervision.
Implications
The findings suggest that practitioners should exercise caution when applying on-policy self-distillation for continual learning, as it may not provide the expected retention benefits. The insights into the trade-offs of supervision density and the dynamics of continual learning could inform future research and applications in training large language models.
Multilayer Q-Matrix-Embedded Neural Network for Cognitive Diagnosis (M-QCDNet): Structure-Aware Deep Learning Architecture for Psychometric Interpretability
Interpretability
- M-QCDNet integrates Q-matrix structure into a neural network for enhanced interpretability and predictive accuracy in cognitive diagnosis.
- The model introduces new evaluation metrics to quantify alignment between predicted skill activations and cognitive theory.
- M-QCDNet supports practical applications in educational settings, enabling early detection of learning difficulties.
- The architecture maintains psychometric meaning while leveraging deep learning capabilities, distinguishing it from prior models.
Read more
Multilayer Q-Matrix-Embedded Neural Network for Cognitive Diagnosis (M-QCDNet): Structure-Aware Deep Learning Architecture for Psychometric Interpretability
Summary
The paper introduces M-QCDNet, a multilayer Q-matrix-embedded neural network designed for cognitive diagnosis, which merges the structural interpretability of cognitive diagnostic models (CDMs) with the capabilities of deep learning neural networks. By embedding the Q-matrix as a structural prior, M-QCDNet ensures that latent mastery profiles are interpretable and aligned with cognitive theory. The model employs a novel loss function with an L2 penalty to maintain this alignment while balancing predictive performance. New evaluation metrics, including the Q-Matrix Consistency Ratio (QCR) and Off-Support Activation (OSA), are proposed to assess how well the learned representations conform to cognitive theory. M-QCDNet enhances classroom practices by facilitating early detection of learning difficulties and supporting mastery-based interventions. It bridges the gap between psychometric transparency and neural flexibility, promoting interpretable and actionable AI in cognitive diagnostics. The model is distinct from previous neural CDMs as it integrates the Q-matrix throughout multiple layers, enhancing both predictive accuracy and interpretability, while also demonstrating compatibility with various CDM frameworks.
Methodology
M-QCDNet employs a multilayer architecture that embeds the Q-matrix as a structural prior across multiple neural layers. It introduces a specialized loss function with an L2 penalty to ensure that latent skill representations align with the Q-matrix. Additionally, it develops new interpretability metrics to evaluate the model's performance against cognitive theory.
Results
M-QCDNet shows improved predictive performance and structural interpretability compared to traditional cognitive diagnostic models. The new evaluation metrics effectively quantify the alignment of predicted skill profiles with predefined item-skill mappings, demonstrating the model's capability to maintain cognitive interpretability while leveraging deep learning.
Implications
The findings suggest that M-QCDNet can significantly enhance cognitive diagnostics in educational settings by providing interpretable insights into student learning and mastery. Its design allows for actionable interventions based on early detection of learning difficulties, promoting a more personalized learning experience.
Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games
Reinforcement Learning
Theory
- Introduction of methods for creating diverse datasets of policies in games.
- Proposal of multiple techniques for learning policy representations, including weight autoencoders and functional encoders.
- Evaluation of learned representations through downstream tasks, confirming the presence of useful behavioral embeddings.
- Focus on two-player zero-sum imperfect-information games, particularly Kuhn and Leduc Poker.
Read more
Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games
Summary
This paper addresses the challenge of learning effective policy representations in two-player zero-sum imperfect-information games, such as poker. The authors present three main contributions: the development of methods to create datasets of policies, the proposal of various techniques for learning policy representations, and the introduction of downstream tasks to evaluate these representations. The study focuses on Kuhn and Leduc Poker, demonstrating that even basic methods can yield useful behavioral representations in the learned embeddings. This work is notable for being one of the first systematic comparisons of self-supervised learning techniques for policy representation in game-theoretic contexts, providing a foundation for future research in this area.
Methodology
The authors developed three methods for generating datasets of policies: random initialization of policy neural networks, the Policy Space Response Oracle (PSRO) algorithm, and a variant of neural population learning (NeuPL). For learning policy representations, they proposed several methods, including a weight autoencoder that reconstructs policy weights and a functional encoder that reconstructs behavior based on action distributions. They evaluated these methods using downstream tasks to assess the effectiveness of the learned embeddings.
Results
The evaluation of the proposed methods on Kuhn and Leduc Poker showed that the learned embeddings contained useful behavioral representations, indicating that even basic techniques can yield significant insights into policy behavior in imperfect-information games.
Implications
The findings suggest that effective policy representations can enhance agents' reasoning capabilities in complex game scenarios, potentially improving strategies in competitive environments. This research lays the groundwork for future advancements in self-supervised learning techniques applied to game-theoretic contexts.
Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
Optimization
Theory
Efficient ML
- Introduces a probabilistic framework for model merging that improves upon traditional geometric approaches.
- Formulates model merging as MAP inference under a product of task-specific energy-based experts.
- Identifies the limitations of Gaussian assumptions in existing methods and proposes a heavy-tailed PoE design.
- Demonstrates significant performance improvements over state-of-the-art methods in empirical evaluations.
Read more
Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
Summary
This paper introduces a novel framework for model merging, which combines multiple task-specific models into a single multi-task model without requiring additional fine-tuning. Traditional methods often rely on geometric properties of solution spaces, which limit their ability to assess the statistical utility of each task-specific update direction. The authors propose a probabilistic inference approach using a product-of-experts (PoE) model, where each task-specific solution acts as an energy-based expert model (EBM). This framework allows for a more nuanced understanding of how to aggregate updates by considering the confidence of each direction based on its support across tasks. The authors identify that existing methods often assume Gaussian distributions for directional residuals, which do not fit well with the observed heavy-tailed behavior of these residuals. To address this, they develop a heavy-tailed PoE design using Cauchy experts, which better captures the characteristics of the residuals and ensures a convergent inference procedure. Empirical evaluations demonstrate that this new approach significantly outperforms state-of-the-art baselines across various tasks and architectures, highlighting its effectiveness in practical applications.
Methodology
The authors develop a probabilistic model merging framework that formulates merging as MAP inference in the fine-tuning parameter space. They utilize energy-based expert models to represent task-specific solutions and aggregate evidence from these experts to determine the merged model parameters. The framework is extended to incorporate heavy-tailed distributions to better align with the observed behavior of directional residuals.
Results
The proposed heavy-tailed PoE model shows substantial improvements in merging performance compared to existing state-of-the-art methods across multiple tasks and architectures. The empirical results validate the effectiveness of the new framework in capturing the complexities of model merging.
Implications
This work has significant implications for the deployment of multi-task models in real-world applications, particularly in scenarios where data is decentralized or privacy-sensitive. The framework can enhance the efficiency of model merging, making it more applicable in resource-constrained environments such as edge devices.
SINA: A Fully Automated Circuit Schematic Image to Netlist Generator Using Artificial Intelligence
Computer Vision
NLP
Multimodal
- SINA achieves a netlist generation accuracy of 96.67%, outperforming existing methods by 2.72 times.
- The system is capable of processing both IC and PCB schematics, including hand-drawn and scanned images.
- SINA incorporates advanced techniques such as deep learning, OCR, and VLM for enhanced component detection and connectivity inference.
- The methodology addresses common pitfalls in existing automated conversion methods, such as misidentifying wire connections and reference designators.
Read more
SINA: A Fully Automated Circuit Schematic Image to Netlist Generator Using Artificial Intelligence
Summary
The paper presents SINA, an open-source, fully automated pipeline designed to convert circuit schematic images into machine-readable netlists, addressing the limitations of existing methods in Electronic Design Automation (EDA). Current approaches struggle with generalization across Integrated Circuit (IC) and Printed Circuit Board (PCB) schematics, accurate component recognition, reliable connectivity inference, distinguishing crossing wires from connections, and effective reference designator extraction. SINA integrates deep learning for component detection, connected-component labeling for connectivity inference, Optical Character Recognition (OCR) for extracting component designators, and a Vision-Language Model (VLM) for assigning these designators accurately. The system is robust across various circuit styles and can handle hand-drawn and scanned schematics without assumptions about color or resolution. Validation of the generated netlists is performed using graph isomorphism techniques, demonstrating a significant improvement in accuracy over state-of-the-art methods.
Methodology
SINA employs a combination of deep learning for component detection, connected-component labeling for connectivity inference, Optical Character Recognition (OCR) for extracting reference designators, and a Vision-Language Model (VLM) for accurate assignment of these designators. The system is designed to be robust and generalizable across various schematic styles and formats.
Results
The experiments conducted show that SINA achieves an overall netlist generation accuracy of 96.67%, which is significantly higher than the accuracy of existing automated conversion methods. The validation of netlists is performed using graph isomorphism techniques to ensure correctness.
Implications
SINA's automated approach to converting circuit schematics into netlists can streamline the workflow in Electronic Design Automation, reduce manual transcription errors, and facilitate the creation of comprehensive databases for AI-based circuit design models. This could lead to enhanced efficiency in circuit design processes and broader accessibility to circuit design knowledge.
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
NLP
Large Language Models
Reinforcement Learning
- Identification of dimensional blind spots as a critical failure mode in single-voiced rubric generation.
- Introduction of Multi-Role Rubric Generation (MRRG) to elicit diverse evaluative perspectives.
- MRRG consistently outperforms existing single-role rubric generation methods across multiple benchmarks.
- Demonstrated improvements in reward signals for RLVR applications, enhancing LLM performance.
Read more
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
Summary
This paper addresses the challenge of generating reliable reward and preference signals for evaluating large language models (LLMs) on open-ended tasks. The authors introduce a novel framework called Multi-Role Rubric Generation (MRRG), which aims to overcome the limitations of existing single-role rubric generators that often lead to dimensional blind spots in human preference evaluation. MRRG elicits evaluation criteria from multiple complementary roles, such as users, domain experts, and educators, thereby creating a more comprehensive and auditable rubric-based scorer. This approach not only enhances the accuracy of preference validation but also improves the reward signals used in Reinforcement Learning with Verifiable Rewards (RLVR). The authors validate MRRG through experiments on various preference validation benchmarks, demonstrating its superiority over single-role baselines across multiple models. The results indicate that MRRG significantly enhances the quality of reward signals for open-ended generation tasks, paving the way for more effective LLM optimization.
Methodology
The authors propose MRRG as a training-free and reference-free framework that generates rubrics by instructing an LLM to adopt multiple evaluative roles. Each role produces distinct, verifiable rubric items, which are then consolidated into a comprehensive rubric-based scorer. This scorer is utilized for both validating pairwise preferences and providing rewards in RLVR settings.
Results
Experimental results show that MRRG outperforms the strongest single-role baselines by 3.1โ16.4 percentage points on preference validation benchmarks. In RLVR experiments, MRRG improves reward signals by 1.7 points on BiGGen Bench and 3.4 points on HealthBench-Hard, indicating its effectiveness in enhancing open-ended generation.
Implications
The findings suggest that incorporating multiple evaluative perspectives can significantly improve the robustness and accuracy of reward modeling in LLMs. This approach may lead to more reliable evaluations and optimizations in various applications involving open-ended tasks, potentially transforming how LLMs are trained and assessed.
ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning
Generative Models
Reinforcement Learning
Optimization
- Introduction of Adaptive Reparameterized Time (ART) for optimal timestep allocation in diffusion sampling.
- Development of ART-RL, a reinforcement learning framework that learns sampling clock rates using Gaussian policies.
- Establishment of a theoretical link between ART and ART-RL, ensuring optimality in policy learning.
- Demonstration of ART's superior performance over traditional sampling schedules in various experimental settings.
Read more
ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning
Summary
This paper introduces Adaptive Reparameterized Time (ART), a novel framework for optimizing timestep allocation in score-based diffusion sampling. Traditional methods often rely on uniform or handcrafted schedules, which can be suboptimal. ART formulates the problem as a continuous-time optimal control issue, allowing for adaptive timestep selection that maximizes sampling efficiency within a fixed computational budget. The authors develop ART-RL, a reinforcement learning approach that utilizes Gaussian policies to learn the optimal sampling clock rate. They establish a theoretical connection between ART and ART-RL, demonstrating that the optimal policy from ART-RL aligns with the ART objective. Extensive experiments show that ART significantly enhances sample quality across various datasets and sampling budgets, outperforming established baseline schedules. Additionally, the learned schedules exhibit strong generalizability, transferring effectively across different datasets and sampling conditions without retraining. This work represents a significant advancement in the control-theoretic approach to generative diffusion sampling, providing a principled alternative to fixed heuristic grids.
Methodology
The authors formulate timestep allocation as a continuous-time optimal control problem, introducing ART. They then develop ART-RL, a reinforcement learning approach that employs Gaussian policies to learn the optimal sampling clock rate. The methodology includes rigorous theoretical analysis linking ART and ART-RL, along with actor-critic updates for policy improvement.
Results
ART consistently outperformed traditional sampling schedules, such as Uniform and DPM, across various datasets and sampling budgets. The learned schedules demonstrated broad generalizability, effectively transferring across different datasets and sampling conditions without the need for retraining.
Implications
The ART framework provides a robust, data-driven method for optimizing sampling in generative models, potentially enhancing the efficiency and quality of generative AI systems. Its generalizability suggests that it could be widely applicable across different domains and applications in generative modeling.
Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
Interpretability
Efficient ML
Large Language Models
- Introduction of Expander SAEs, a parameter-efficient architecture for sparse autoencoders.
- Demonstrated a storage-fidelity trade-off across multiple language models, achieving 293ร fewer learned decoder values with minimal loss in fidelity.
- Proposed a parallel implementation of OMP that enhances decoding efficiency.
- Theoretical proofs supporting the identifiability of k-sparse codes under specific conditions.
Read more
Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
Summary
This paper introduces Expander Sparse Autoencoders (Expander SAEs), a novel architecture designed to enhance mechanistic interpretability in neural networks by efficiently learning sparse representations. Traditional Sparse Autoencoders (SAEs) face challenges with high storage and computational costs due to dense decoders, especially as feature counts increase. Expander SAEs address this by utilizing a left-d-regular expander mask, significantly reducing the number of learned decoder values while maintaining the integrity of the sparse-coding problem. The authors demonstrate that varying the sparsity parameter (d) leads to a consistent trade-off between storage efficiency and reconstruction fidelity across several language models, achieving substantial reductions in learned values while preserving performance. Additionally, they propose a parallel implementation of Orthogonal Matching Pursuit (OMP) that leverages the expander structure for efficient decoding. The theoretical contributions include proofs of identifiability for noiseless k-sparse codes under specific conditions, providing a solid foundation for the architecture's effectiveness. Overall, the paper presents a significant advancement in the field of mechanistic interpretability through efficient parameterization and improved decoding strategies.
Methodology
The authors developed Expander SAEs by masking the encoder and decoder dictionaries with a left-d-regular expander graph, which reduces the number of learned decoder values from mn to dn. They conducted experiments on various language models to analyze the trade-off between storage and fidelity, and implemented a parallel OMP for efficient decoding. Theoretical analysis was performed to establish conditions for identifiability of sparse codes.
Results
Experiments showed that Expander SAEs could achieve 84% of the full dense cross-entropy loss while using 293ร fewer learned decoder values in the Qwen2.5-3B model with d=7. The proposed architecture demonstrated a smooth trade-off between storage and reconstruction fidelity across different models.
Implications
The findings suggest that Expander SAEs can significantly reduce the computational and storage overhead associated with traditional sparse autoencoders, making them more practical for large-scale applications in mechanistic interpretability. This could lead to better understanding and interpretability of neural network representations, particularly in complex models like large language models.
Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection
Theory
Time Series
Optimization
- The study evaluates Rolling Split Conformal Prediction for detecting pre-incident traction loss in motorsport.
- Results showed a mean precision and recall of essentially 0.0, indicating the method's ineffectiveness.
- The high false-alarm rate (15.3%) suggests significant limitations in the current approach.
- Methodological rigor is emphasized, with diagnostics revealing violations of underlying assumptions.
Read more
Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection
Summary
This paper investigates the efficacy of Rolling Split Conformal Prediction as a method for pre-incident traction loss detection in motorsport telemetry. The study utilizes telemetry data from the 2023 Italian Grand Prix, focusing on 19 drivers and employing a Random Forest model to predict expected slip behavior. The authors aimed to identify rising volatility in prediction errors as a potential early warning signal for traction loss. However, the results were disappointing, with the method achieving a mean precision and recall of essentially 0.0 against 14 ground-truth incidents, while flagging an average of 15.3% of all samples as anomalous, indicating a high false-alarm rate. The study highlights methodological rigor in its negative findings, diagnosing issues such as the violation of the exchangeability assumption in conformal prediction, which contributed to the poor performance. The paper concludes by outlining necessary changes for future iterations of this predictive approach to be viable.
Methodology
The methodology involved training a Random Forest Regressor on telemetry data to predict slip behavior, using a corrected theoretical framework that included vehicle speed as a model feature. The study employed a rolling volatility metric to flag potential anomalies, which were then validated against real incident records from FIA Race Control Messages.
Results
The method achieved a mean precision and recall of essentially 0.0 across 55,563 telemetry samples from 19 drivers, with a false-alarm rate of 15.3%. Diagnostics indicated that the split-conformal exchangeability assumption was violated for all drivers, contributing to the high false-alarm rate.
Implications
The findings suggest that while the approach of using conformal prediction for early warning signals is theoretically appealing, practical implementation requires significant refinement. Future work should address the identified methodological issues to enhance predictive capabilities in traction loss detection.
SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication
Large Language Models
Optimization
Efficient ML
- SCAPE enables extreme sparsification of communication in LLM training without sacrificing model performance.
- The optimizer is built on AdamS, which shows improved robustness to high sparsity compared to AdamW.
- The method achieves up to 43.3% reduction in pre-training wall-clock time while maintaining model quality.
- SCAPE's approach allows for efficient synchronization of masks and computation, enhancing overall training efficiency.
Read more
SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication
Summary
The paper introduces SCAPE, a novel communication-efficient distributed optimizer designed for training large language models (LLMs). As communication costs rise with model size and system scale, SCAPE addresses the limitations of existing methods that either sparsify gradients or quantize communication. SCAPE leverages the stability of the AdamS optimizer's first-moment statistics to enable aggressive sparsification without compromising model quality. It employs a partitioned mask-refresh mechanism aligned with optimizer sharding, computes masks from momentum-based statistics, and delays mask usage to allow for overlapping computation and synchronization. Implemented in Megatron-LM, SCAPE was evaluated by pre-training GPT-345M and Llama-500M on specific datasets using 32 NVIDIA GH200 GPUs. The results demonstrate that SCAPE maintains training stability and model performance even at 90% and 99% sparsity, significantly reducing pre-training time while preserving accuracy on downstream tasks.
Methodology
SCAPE utilizes a partitioned mask-refresh mechanism that aligns with optimizer sharding, computes masks from first-moment statistics instead of raw gradients, and delays mask usage to overlap with computation. This approach minimizes communication overhead by reconstructing necessary quantities for second-moment updates from a single synchronized sparse buffer.
Results
SCAPE was tested on GPT-345M and Llama-500M models, achieving significant reductions in pre-training time (up to 43.3%) while maintaining training stability and accuracy on downstream tasks, even at high sparsity levels (90% and 99%). For Llama-1.8B, SCAPE achieved a speedup of up to 3.26 times per step compared to dense AdamS.
Implications
SCAPE's communication-efficient training method could facilitate faster and more resource-efficient training of large language models, making it applicable in environments with limited computational resources. Its design may also inspire future optimizers that prioritize communication efficiency in distributed training settings.
Evolutionary Feature Engineering for Structured Data
Time Series
Optimization
Interpretability
- EFE framework utilizes LLMs for evolving preprocessing transformations in structured data.
- EFE-Time improves time-series forecasting accuracy with dataset-specific normalization.
- EFE-Tab evolves compact feature programs, enhancing interpretability and performance.
- The methodology integrates feedback from downstream performance to refine transformations.
Read more
Evolutionary Feature Engineering for Structured Data
Summary
The paper introduces Evolutionary Feature Engineering (EFE), a novel framework that leverages large language models (LLMs) to discover preprocessing transformations for structured data. EFE represents transformations as Python programs with a standardized fit/transform interface, enabling seamless integration into existing machine learning pipelines. The framework operates in two primary settings: EFE-Time for time-series forecasting and EFE-Tab for tabular prediction. EFE-Time evolves dataset-specific, invertible normalization programs that enhance the performance of time-series foundation models, achieving significant reductions in forecasting errors across various datasets. EFE-Tab focuses on evolving compact feature programs that improve or match the performance of existing LLM-based feature engineering methods, particularly excelling with classical decision trees. The results demonstrate that EFE can enhance both accuracy and interpretability in structured data tasks, showcasing the potential of LLM-based evolution in automating feature engineering.
Methodology
The EFE framework employs an evolutionary search process where LLMs propose transformations based on dataset metadata, summary statistics, and past trial outcomes. Each transformation is evaluated against a fixed downstream model, with validation performance feeding back into the evolutionary loop to refine candidate programs. EFE-Time focuses on evolving invertible normalization programs for time-series data, while EFE-Tab optimizes small, high-value feature programs for tabular data.
Results
In experiments, EFE-Time demonstrated a reduction in forecasting errors by over 3% on average across datasets, with improvements up to 19% on the COVID-Deaths dataset. EFE-Tab achieved the best mean rank among compared feature-engineering methods, particularly excelling with decision trees, where evolved features provided competitive accuracy while maintaining interpretability.
Implications
The findings suggest that LLM-based evolution can significantly enhance feature engineering processes, making it easier to automate the discovery of effective preprocessing transformations. This has implications for improving model performance in various structured data tasks, potentially leading to more efficient and interpretable machine learning applications.
SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs
Graph Learning
NLP
Large Language Models
- SABER integrates LLM-derived semantics directly into the brain network classification process.
- The framework employs multi-scale hypergraphs to capture complex interactions among brain regions.
- A decision-level semantic alignment mechanism allows for patient-specific semantic information to influence predictions.
- SABER outperforms existing methods on benchmark datasets, showcasing improved robustness and interpretability.
Read more
SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs
Summary
The paper introduces SABER, a novel framework for brain network analysis that integrates high-level semantic knowledge from large language models (LLMs) into the decision-making process for brain disease diagnosis. Traditional methods often treat semantic information as auxiliary, limiting its impact on classification stability and robustness. SABER addresses this by incorporating ROI-level semantics through global self-attention, enriching node representations and providing a whole-brain context. The framework constructs multi-scale hypergraphs to model functional subnetworks and multi-ROI interactions, overcoming the limitations of conventional graph neural networks (GNNs) in capturing high-order dependencies. A decision-level semantic alignment mechanism is employed to inject patient-specific textual embeddings into graph representations, allowing semantics to guide predictions directly. Experiments conducted on public brain network datasets, ABIDE and ADHD-200, demonstrate that SABER achieves state-of-the-art performance, enhanced stability, and improved interpretability, particularly in scenarios with small sample sizes.
Methodology
The SABER framework consists of three main stages: (1) Multi-scale node-level brain network encoding, where ROI semantics are injected into node features using a global self-attention mechanism; (2) Construction of multi-scale hypergraphs to model high-order and multi-ROI interactions; (3) Implementation of a decision-level semantic alignment mechanism that integrates patient-specific textual embeddings into the graph representation for direct influence on predictions.
Results
SABER demonstrated superior performance compared to existing methods on the ABIDE and ADHD-200 datasets, particularly excelling in small-sample settings. The framework not only improved classification accuracy but also enhanced the interpretability of the results, allowing for better clinical insights.
Implications
The integration of semantic knowledge into brain network analysis could lead to more accurate and robust diagnostic tools for brain diseases, potentially improving clinical decision-making and patient outcomes. The framework may also inspire further research into the synergy between LLMs and neuroimaging data.
EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
NLP
Large Language Models
Efficient ML
- EPnG dynamically reallocates resources based on expert importance derived from routing dynamics.
- The framework prunes under-utilized experts and expands high-importance experts, maintaining a fixed parameter budget.
- EPnG outperforms traditional LoRA methods while updating significantly fewer parameters.
- The approach aligns parameter-efficient fine-tuning with the unique characteristics of MoE architectures.
Read more
EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
Summary
The paper introduces EPnG, an innovative framework designed to enhance the parameter-efficient fine-tuning of Mixture-of-Experts (MoE) models. Traditional fine-tuning methods, such as LoRA, fail to account for the unique routing dynamics of MoE architectures, leading to inefficient resource allocation and underutilization of experts. EPnG addresses this issue by implementing an adaptive prune-and-grow strategy that reallocates LoRA capacity based on the importance of experts, as determined by router gate probabilities. The framework prunes low-importance experts and expands the ranks of high-importance experts, all while adhering to a fixed parameter budget. The authors demonstrate that EPnG consistently outperforms static LoRA methods across various MoE models, achieving performance levels comparable to full fine-tuning while updating only a fraction of the parameters (0.55%โ0.72%). This approach not only optimizes resource use but also aligns fine-tuning with the routing behavior of MoE, resulting in a more scalable and effective strategy for adapting large language models.
Methodology
EPnG employs a prune-and-grow mechanism where it first estimates expert importance from router gate probabilities collected during training. It prunes low-importance experts based on a defined threshold and reallocates the freed capacity to high-importance experts by expanding their LoRA ranks with orthogonal initialization. This process is repeated iteratively to optimize the allocation of parameters.
Results
The experimental results show that EPnG consistently outperforms the static LoRA method under the same parameter budget across OLMoE and Qwen1.5-MoE models. EPnG achieves performance levels comparable to full fine-tuning while updating only 0.55%โ0.72% of parameters, demonstrating a reduction of up to 140x-180x fewer parameters updated.
Implications
The findings suggest that aligning parameter-efficient fine-tuning strategies with the routing dynamics of MoE models can lead to more effective adaptations of large language models, potentially influencing future research and applications in scalable model training and deployment.
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Graph Learning
- SA-HGNN dynamically constructs individualized brain network topologies to enhance representation accuracy.
- The use of hyperbolic geometry allows for better modeling of hierarchical structures in brain connectivity.
- An attention pooling mechanism effectively reduces noise in EEG signals, preserving essential topological features.
- The model outperforms traditional GNNs based on Euclidean metrics in EEG-based depression recognition tasks.
Read more
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Summary
The paper introduces the Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN), a novel approach designed to enhance EEG-based depression recognition by accurately modeling the hierarchical structure of brain networks affected by Major Depressive Disorder (MDD). Traditional Graph Neural Networks (GNNs) struggle to capture the inherent hierarchical nature of brain connectivity due to their reliance on Euclidean space, which can distort complex relationships. SA-HGNN addresses this limitation through three core modules: a Sample-Adaptive Graph Construction module that dynamically creates personalized brain network topologies, a Hyperbolic Graph Convolution module that utilizes hyperbolic geometry to better represent hierarchical relationships, and an Attention Pooling module that filters out noise from EEG signals. The model's effectiveness is validated through extensive experiments on public EEG datasets, demonstrating superior performance in both resting-state and task-related paradigms, thus showcasing its robustness against noise and its ability to capture abnormal functional connectivity patterns in depressed patients.
Methodology
The methodology involves three main components: (1) Sample-Adaptive Graph Construction for personalized topology creation, (2) Hyperbolic Graph Convolution to leverage hyperbolic space for capturing hierarchical relationships, and (3) Attention Pooling to filter out noise from EEG signals, thereby enhancing the model's ability to represent the authentic structure of brain networks.
Results
The SA-HGNN demonstrated superior performance compared to traditional GNNs in recognizing depression from EEG data, achieving significant improvements in both resting-state and task-related paradigms. The model's robustness to noise and its efficacy in capturing abnormal functional connectivity patterns were validated through extensive experiments on public EEG datasets.
Implications
The findings suggest that SA-HGNN could serve as a powerful tool for automated depression diagnosis, potentially leading to earlier and more accurate identification of Major Depressive Disorder. This approach may also inspire further research into the application of hyperbolic geometry in other areas of neuroscience and mental health diagnostics.