gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-05 • Found 24 papers

Agentic AI-Empowered Dynamic Survey Framework

Furkan Mumcu, Lokman Bekit, Michael J. Jones, Anoop Cherian, Yasin Yilmaz
  • The paper formalizes survey writing as a long-term maintenance problem, treating surveys as dynamic, evolving documents.
  • The proposed framework uses agentic AI to incrementally update surveys, ensuring coherence and minimizing unnecessary disruptions.
  • Conservative update mechanisms, including abstention and lightweight validation, are introduced to prevent structural and stylistic drift.
  • A retrospective evaluation protocol is designed to systematically assess the framework's performance in routing accuracy, update quality, and document stability.
  • The framework addresses the growing challenge of maintaining up-to-date surveys in the face of rapid research output growth.
Read More
Abstract
This paper introduces the Agentic AI-Empowered Dynamic Survey Framework, a novel approach to addressing the challenges of maintaining survey papers in rapidly evolving research landscapes. Traditional survey papers often become outdated as new research emerges, leading to redundancy and fragmentation in the literature. The authors propose treating surveys as 'living documents' that evolve over time through continuous updates, rather than static artifacts capturing a single snapshot of the field. The framework leverages agentic AI systems to incrementally integrate new research into existing surveys while preserving their structure, coherence, and writing style. By decomposing the update process into distinct stages and enforcing conservative editing constraints, the framework minimizes disruption and ensures factual accuracy. A retrospective experimental protocol is designed to evaluate the framework's effectiveness, simulating real-world survey maintenance scenarios. The results demonstrate that the framework can effectively identify and incorporate new research while maintaining the stability and quality of the survey.
Methodology
The authors propose a structured framework that decomposes the survey update process into three stages: paper analysis, section routing, and conservative localized synthesis. The framework treats the survey as a persistent document state and uses agentic AI systems to integrate new research incrementally. Conservative editing constraints and abstention mechanisms are employed to minimize disruptions and ensure updates are accurate and relevant. A retrospective experimental protocol is used to simulate real-world survey maintenance by withholding portions of existing surveys and reintroducing them as new research.
Results
The framework effectively identifies and integrates emerging research into existing surveys while preserving their structure and coherence. The retrospective evaluation demonstrates high routing accuracy, quality of updates, and minimal disruption to the original document. The framework's conservative mechanisms successfully prevent incorrect or out-of-scope edits, ensuring the stability of the survey over time.
Implications
The proposed framework has significant implications for the academic community, offering a scalable solution to maintaining up-to-date surveys in rapidly evolving fields. By reducing redundancy and fragmentation in the literature, it can improve knowledge synthesis and accessibility. The framework could also serve as a model for other dynamic document maintenance tasks, such as updating textbooks, technical reports, or policy documents.
View on arXiv

An Empirical Survey and Benchmark of Learned Distance Indexes for Road Networks

Gautam Choudhary, Libin Zhou, Yeasir Rayhan, Walid G. Aref
  • The paper systematically evaluates ten ML-based distance indexes for road networks, contrasting them with four classical baselines.
  • A unified encoder-decoder abstraction is introduced to structure the design space of learned distance indexes.
  • The benchmark uses real-world query workloads to evaluate methods on approximation error, preprocessing time, query latency, and storage requirements.
  • Learned indexes achieve significant query latency reductions but involve trade-offs in accuracy and training overhead.
  • An open-source codebase is provided to facilitate reproducibility and further research in this domain.
Read More
Abstract
This paper presents the first comprehensive empirical survey and benchmark of machine learning-based distance indexes for shortest-path distance estimation in road networks. While classical algorithms like Dijkstra’s provide exact solutions, their computational latency makes them impractical for large-scale, real-time applications. Learned distance indexes leverage machine learning techniques, such as neural networks, graph neural networks, and tree-based models, to approximate shortest-path distances with reduced storage and faster query times. The authors evaluate ten ML-based methods alongside four classical baselines across seven real-world road networks, focusing on dimensions such as training time, query latency, storage overhead, and accuracy. The study introduces a unified encoder-decoder abstraction to conceptualize learned distance indexes and employs a workload-driven benchmark derived from real-world query datasets. Key insights into trade-offs between accuracy, efficiency, and scalability are provided, along with an open-source codebase to support reproducibility and future research.
Methodology
The authors benchmark ten ML-based distance indexes, including neural networks, graph neural networks, and gradient-boosted trees, against four classical baselines (three approximate indexes and one exact index). They use seven real-world road networks and workload-driven query datasets derived from trajectory data. The evaluation is conducted along key dimensions such as accuracy, query latency, preprocessing time, and storage overhead. A unified encoder-decoder abstraction is introduced to conceptualize the design of learned distance indexes.
Results
The study reveals that learned distance indexes can achieve query latency reductions down to tens of nanoseconds, significantly outperforming classical methods in speed. However, these methods involve trade-offs in terms of approximation error and training overhead. The benchmark highlights the scalability of ML-based approaches to large road networks and provides insights into their relative strengths and weaknesses compared to classical baselines.
Implications
The findings have significant implications for real-time navigation systems, location-based services, and spatial analytics. By leveraging learned distance indexes, applications can achieve faster query processing with reduced storage requirements, enabling scalability to large road networks. The open-source codebase facilitates future research and development in this area, potentially driving innovation in transportation and urban planning systems.
View on arXiv

Causal Discovery for Cross-Sectional Data Based on Super-Structure and Divide-and-Conquer

Wenyu Wang, Yaping Wan
  • Introduces a lightweight causal discovery framework that uses weakly constrained Super-Structures for efficient graph partitioning.
  • Shifts the focus from high-recall Super-Structures to high-precision scaffolds, reducing computational costs without sacrificing accuracy.
  • Empirical validation shows substantial reductions in CI tests while maintaining competitive structural accuracy on synthetic and real-world datasets.
  • Demonstrates practical applicability in large-scale domains like biomedical and social science research.
  • Establishes a scalable approach to causal discovery under minimal assumptions about initial graph structures.
Read More
Abstract
This paper addresses the computational challenges in causal discovery for cross-sectional data, particularly in constructing accurate Super-Structures for divide-and-conquer approaches. The authors propose a novel framework that relaxes the strict requirements on Super-Structure construction, focusing on high precision rather than high recall. By integrating weakly constrained Super-Structures with efficient graph partitioning and merging strategies, the framework reduces the computational overhead associated with conditional independence (CI) tests while maintaining competitive accuracy. The proposed algorithm is validated on synthetic benchmarks and real-world datasets, including the China Health and Retirement Longitudinal Study (CHARLS). Results demonstrate that the method achieves structural accuracy comparable to established algorithms like PC and FCI, while significantly reducing the number of CI tests required. This work opens new possibilities for scalable causal discovery in large-scale, knowledge-scarce domains such as biomedical and social sciences.
Methodology
The proposed framework integrates weakly constrained Super-Structures with divide-and-conquer strategies for causal discovery. It uses graph partitioning and merging techniques to decompose high-dimensional data into smaller subsets, enabling efficient local causal discovery. The algorithm prioritizes precision in Super-Structure construction to reduce CI test overhead. Validation is performed on synthetic Gaussian Bayesian networks and real-world datasets, comparing structural accuracy and computational efficiency against established methods like PC and FCI.
Results
The framework achieves structural accuracy comparable to PC and FCI algorithms while drastically reducing the number of CI tests required. Experiments on synthetic datasets (e.g., magic-NIAB, ECOLI70, magic-IRRI) and the CHARLS dataset confirm its scalability and practical applicability. The method demonstrates significant computational savings with minimal loss in fidelity, making it suitable for large-scale causal discovery tasks.
Implications
This work enables scalable causal discovery in domains with limited domain knowledge and high-dimensional data, such as biomedical and social sciences. By reducing computational costs, the framework can facilitate the analysis of large datasets, improving the feasibility of causal inference in real-world applications. It also provides a foundation for further research into efficient causal discovery methods that leverage weakly constrained graph structures.
View on arXiv

Child Mortality Prediction in Bangladesh: A Decade-Long Validation Study

Md Muhtasim Munif Fahim, Md Rezaul Karim
  • Temporal validation was implemented to avoid look-ahead bias, using training data from 2011–2014, validation data from 2017, and testing data from 2022.
  • Neural Architecture Search identified a simple single-layer neural network that outperformed XGBoost in predicting child mortality (AUROC 0.76 vs. 0.73).
  • The model exhibited a 'Socioeconomic Predictive Gradient,' with stronger performance in poorer regions (AUROC 0.74 in Sylhet/Rangpur) compared to wealthier regions (AUROC 0.66 in Dhaka/Khulna).
  • At a 10% screening threshold, the model could identify approximately 1,300 additional at-risk children annually compared to traditional methods.
  • The study highlights the importance of equity-focused machine learning models for addressing structural mortality risk factors in low-resource settings.
Read More
Abstract
This study addresses the challenge of predicting child mortality in Bangladesh using machine learning models validated over a decade-long temporal framework. The authors highlight the issue of 'look-ahead bias' in traditional random cross-validation approaches and propose a strict temporal validation methodology to ensure realistic performance estimates. Using data from the Bangladesh Demographic and Health Surveys (BDHS) spanning 2011–2022 (n=33,962 births), the study employs Neural Architecture Search (NAS) optimized via genetic algorithms to identify an effective model architecture. The resulting single-layer neural network (64 units) outperformed traditional gradient boosting methods (AUROC 0.76 vs. 0.73). A notable finding is the 'Socioeconomic Predictive Gradient,' where the model demonstrated higher predictive accuracy in poorer regions compared to wealthier ones, suggesting structural risk factors are more detectable in under-resourced settings. The study provides a robust, equity-focused tool for identifying at-risk children, enabling targeted maternal and child health interventions in resource-constrained regions.
Methodology
The study utilized BDHS data collected over four survey periods (2011, 2014, 2017, and 2022) and divided it temporally to simulate real-world deployment scenarios. Neural Architecture Search (NAS) optimized via genetic algorithms was employed to identify the best-performing model architecture. Feature engineering transformed raw survey data into clinically relevant predictors based on prior epidemiological studies. The model was evaluated using AUROC and fairness audits across Bangladesh's eight administrative divisions.
Results
The single-layer neural network achieved an AUROC of 0.76, outperforming XGBoost (AUROC 0.73, p<0.01). The model demonstrated higher predictive accuracy in poorer regions (AUROC 0.74 in Sylhet/Rangpur) compared to wealthier regions (AUROC 0.66 in Dhaka/Khulna). At a 10% screening threshold, the model identified approximately 1,300 additional at-risk children annually compared to gradient boosting approaches.
Implications
This study provides a production-ready machine learning tool for identifying at-risk children in resource-constrained settings, enabling targeted maternal and child health interventions. The findings also emphasize the need for equity-focused models that account for socioeconomic disparities in predictive performance. The methodology can be applied to other global health challenges requiring robust temporal validation and fairness audits.
View on arXiv

Dynamical Regimes of Multimodal Diffusion Models

Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen
  • Introduces a theoretical framework for coupled diffusion models using Ornstein-Uhlenbeck processes to study multimodal generation dynamics.
  • Identifies the 'synchronization gap,' a temporal hierarchy in the reverse generative process where eigenmodes stabilize at different rates.
  • Derives analytical conditions for speciation and collapse times, providing bounds on coupling strength to prevent unstable symmetry breaking.
  • Demonstrates that coupling strength acts as a spectral filter, enforcing temporal hierarchies in multimodal generation.
  • Validates theoretical predictions through controlled experiments on MNIST datasets and exact score samplers.
Read More
Abstract
This paper provides a theoretical framework for understanding the dynamics of multimodal diffusion models using coupled Ornstein-Uhlenbeck (OU) processes. The authors analyze the reverse generative process in multimodal settings, identifying a 'synchronization gap'—a temporal window where different eigenmodes stabilize at distinct rates. This gap explains desynchronization artifacts commonly observed in multimodal generation. The study derives analytical conditions for speciation and collapse times under symmetric and anisotropic coupling regimes, offering strict bounds on coupling strength to avoid unstable symmetry breaking. The coupling strength is shown to act as a spectral filter, enforcing a tunable temporal hierarchy on the generation process. Controlled experiments on MNIST datasets and exact score samplers validate the theoretical predictions. The findings suggest that time-dependent coupling schedules targeting mode-specific timescales could improve multimodal generation, offering an alternative to heuristic guidance tuning.
Methodology
The authors model multimodal diffusion processes using coupled Ornstein-Uhlenbeck processes, leveraging nonequilibrium statistical physics and dynamical phase transitions. They derive analytical solutions for symmetric and anisotropic coupling regimes, using tools like random energy models (REM) and spectral analysis. Controlled experiments on MNIST datasets and exact score samplers are conducted to validate the theoretical predictions.
Results
The study identifies a 'synchronization gap' during the reverse generative process, where different eigenmodes stabilize at distinct rates. Analytical conditions for speciation and collapse times are derived, showing how coupling strength influences the temporal hierarchy of generation. Experimental results on MNIST datasets confirm the theoretical predictions, demonstrating the impact of coupling strength on multimodal synchronization and desynchronization artifacts.
Implications
The findings provide a deeper theoretical understanding of multimodal diffusion models, offering insights into the dynamics of multimodal generation. The proposed framework could guide the design of time-dependent coupling schedules to improve the quality of multimodal outputs, potentially replacing heuristic guidance tuning. This has applications in text-to-image, audio-video synthesis, and other multimodal generative tasks.
View on arXiv

Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

Alexander Häußer
  • Echo State Networks (ESNs) are evaluated for univariate time series forecasting using the M4 dataset, focusing on monthly and quarterly series with up to 20 years of historical data.
  • A hyperparameter sweep of over four million ESN configurations reveals interpretable patterns, with optimal settings varying by data frequency and temporal resolution.
  • ESNs achieve competitive forecasting accuracy, outperforming statistical models like ARIMA and TBATS for quarterly data and matching their performance for monthly data.
  • The ESN framework is computationally efficient, making it suitable for large-scale, automated forecasting tasks.
  • The study demonstrates the practical applicability of ESNs in business and economic forecasting scenarios with short historical data.
Read More
Abstract
This paper explores the use of Echo State Networks (ESNs) for univariate time series forecasting, focusing on monthly and quarterly data from the M4 Forecasting Competition dataset. ESNs, a type of reservoir computing model, offer a balance between computational efficiency and predictive accuracy by using a fixed, randomly initialized reservoir and training only a linear readout layer. The study conducts a large-scale hyperparameter sweep, evaluating over four million ESN configurations to optimize parameters such as leakage rate, spectral radius, reservoir size, and regularization criteria. Forecast accuracy is assessed using MASE and sMAPE metrics and benchmarked against statistical models like ARIMA, ETS, and TBATS, as well as naive methods. Results show that ESNs perform competitively, achieving the lowest mean MASE for quarterly data and comparable accuracy to ARIMA and TBATS for monthly data, while requiring less computational effort. The findings highlight ESNs as a robust and scalable option for automated time series forecasting, particularly in scenarios with limited historical data.
Methodology
The study employs a two-stage evaluation approach. First, a hyperparameter sweep is conducted on a Parameter dataset to optimize ESN configurations, including leakage rate, spectral radius, reservoir size, and regularization criteria. Second, out-of-sample forecasting accuracy is assessed on a disjoint Forecast dataset using standardized metrics (MASE and sMAPE). Preprocessing steps include stationarity testing, differencing, and scaling. Forecasts are generated recursively over standard M4 horizons (18 months for monthly data and 8 quarters for quarterly data). ESN performance is benchmarked against naive methods and statistical models like ARIMA, ETS, and TBATS.
Results
The hyperparameter sweep reveals that monthly series favor moderately persistent reservoirs, while quarterly series prefer more contractive dynamics. High leakage rates are optimal across both frequencies, with spectral radii and reservoir sizes varying by temporal resolution. In out-of-sample evaluations, ESNs perform on par with ARIMA and TBATS for monthly data and achieve the lowest mean MASE for quarterly data. ESNs also demonstrate lower computational costs compared to complex statistical models.
Implications
The findings position Echo State Networks as a viable alternative to traditional statistical methods for automated time series forecasting, particularly in business and economic applications with short historical data. Their computational efficiency and robustness make them suitable for large-scale forecasting tasks, addressing the growing demand for scalable and automated solutions in data-driven decision-making.
View on arXiv

Forget to Generalize: Iterative Adaptation for Generalization in Federated Learning

Abdulrahman Alotaibi, Irene Tenison, Miriam Kim, Isaac Lee, Lalana Kagal
  • Introduces Iterative Federated Adaptation (IFA), a 'forget and evolve' training paradigm for federated learning.
  • Addresses the challenge of non-IID data distributions by periodically resetting a fraction of model parameters.
  • Demonstrates significant improvements in global accuracy (average 21.5%) across three datasets: CIFAR-10, MIT-Indoors, and Stanford Dogs.
  • IFA can be integrated with any existing federated learning algorithm to enhance generalization performance.
  • The method explicitly combats representational over-specialization and encourages learning globally relevant features.
Read More
Abstract
This paper introduces Iterative Federated Adaptation (IFA), a novel training paradigm designed to improve generalization in federated learning (FL) under non-IID (non-independent and identically distributed) client data distributions. Federated learning enables privacy-preserving, decentralized model training across diverse user devices and datasets, but its performance often suffers in real-world scenarios where data distributions vary significantly across clients. IFA addresses this challenge by introducing a 'forget and evolve' strategy, where training is divided into multiple generations. At the end of each generation, a fraction of the model parameters is re-initialized, either randomly or by targeting the later layers of the model. This periodic resetting helps the model escape local minima, shed client-specific biases, and improve global generalization. The approach is evaluated on three datasets—CIFAR-10, MIT-Indoors, and Stanford Dogs—demonstrating an average improvement of 21.5% in global accuracy compared to baseline methods. IFA is designed to be a plug-and-play enhancement that can be integrated with existing federated learning algorithms, making it a practical solution for improving performance in heterogeneous and distributed web systems.
Methodology
The authors propose dividing the federated training process into sequential generations. At the end of each generation, a fraction of the model parameters is re-initialized using one of two strategies: (1) Random Parameter Selection, where parameters are randomly chosen, or (2) Later Layer Parameter Selection, where parameters from the later layers (closer to the classification head) are reset. This iterative 'forget and evolve' process allows the model to avoid overfitting to client-specific data and promotes learning generalizable features. The method is evaluated using standard federated learning benchmarks and compared against existing approaches.
Results
The proposed IFA method achieves an average improvement of 21.5% in global accuracy across three datasets (CIFAR-10, MIT-Indoors, and Stanford Dogs) under non-IID data distributions. The results demonstrate that IFA effectively mitigates client drift and enhances generalization performance, outperforming baseline federated learning algorithms.
Implications
IFA has significant implications for real-world federated learning applications, particularly in scenarios with highly heterogeneous data distributions, such as personalized recommendation systems, mobile applications, and distributed web services. By improving generalization and reducing client-specific biases, IFA enables more robust and scalable privacy-preserving machine learning systems.
View on arXiv

From Sparse Sensors to Continuous Fields: STRIDE for Spatiotemporal Reconstruction

Yanjie Tong, Peng Chen
  • STRIDE combines a temporal encoder and a modulated implicit neural representation (INR) decoder to reconstruct continuous spatiotemporal fields from sparse sensor data.
  • The framework is resolution- and discretization-invariant, enabling application to irregular meshes and super-resolution tasks.
  • The use of the FMMNN backbone improves the representation of complex spatial fields and ensures stable optimization compared to sine-activated INRs.
  • Theoretical justification is provided under a stable delay-observability assumption, supporting the architecture's design and effectiveness.
  • Extensive experiments show STRIDE's robustness to noise and superior performance across multiple challenging benchmarks.
Read More
Abstract
This paper introduces STRIDE (Spatio-Temporal Recurrent Implicit DEcoder), a novel two-stage framework for reconstructing high-dimensional spatiotemporal fields from sparse point-sensor measurements. STRIDE addresses challenges in learning parametric partial differential equation (PDE) dynamics, particularly in scenarios with sparse sensing, irregular meshes, and varying resolutions. The framework consists of a temporal encoder that maps a short history of sensor measurements to a latent state and a modulated implicit neural representation (INR) decoder that reconstructs the field at arbitrary spatial locations. STRIDE leverages the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN) as the INR backbone, which enhances the representation of complex spatial fields and improves optimization stability compared to sine-based INRs. The authors provide a theoretical justification for STRIDE under a stable delay-observability condition, showing that the reconstruction operator factors through a finite-dimensional embedding. Extensive experiments on four benchmarks (chaotic dynamics, fluid flow, shallow water, and seismic wave propagation) demonstrate STRIDE's superior performance in sparse sensing, super-resolution, and robustness to noise, outperforming existing methods.
Methodology
STRIDE is a two-stage framework. In the first stage, a temporal encoder (e.g., LSTM) processes a short window of sparse sensor measurements to produce a latent state. In the second stage, a modulated implicit neural representation (INR) decoder, built on the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), reconstructs the spatiotemporal field at arbitrary query locations. The model is trained using randomized spatial sampling to ensure resolution-invariant decoding. Theoretical analysis is provided to justify the architecture under stable delay-observability conditions.
Results
STRIDE outperforms strong baselines on four benchmarks involving chaotic dynamics, fluid flow, shallow water, and seismic wave propagation. It demonstrates superior performance in scenarios with extremely sparse sensing, supports super-resolution, and remains robust to noise. Ablation studies confirm the effectiveness of the FMMNN backbone, temporal encoder choices, and the model's ability to generalize across spatial resolutions and irregular meshes.
Implications
STRIDE has significant implications for scientific and engineering applications where high-dimensional spatiotemporal fields need to be reconstructed from sparse measurements. Its resolution-invariant decoding and robustness to noise make it suitable for tasks such as weather modeling, fluid dynamics simulations, seismic analysis, and real-time monitoring of physical systems governed by PDEs. The framework's ability to generalize across parameter settings and trajectories also makes it a promising tool for data-driven surrogate modeling in computational physics and engineering.
View on arXiv

GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu
  • GeoIB reformulates the Information Bottleneck (IB) principle using statistical manifold geometry, avoiding the need for mutual information estimation.
  • The method introduces two geometry-aware penalties: the Fisher–Rao (FR) discrepancy and the Jacobian–Frobenius (JF) penalty, which together ensure stable and faithful compression.
  • A natural-gradient optimizer consistent with the FR metric is derived, providing first-order equivalence to geodesic updates.
  • GeoIB outperforms traditional IB baselines in accuracy-compression trade-offs and demonstrates robustness under strong compression.
  • The approach unifies distributional and geometric regularization under a single bottleneck multiplier, improving optimization stability.
Read More
Abstract
This paper introduces the Geometric Information Bottleneck (GeoIB), a novel reformulation of the Information Bottleneck (IB) principle that leverages statistical manifold geometry to improve stability, robustness, and controllability in representation learning under strong compression. Unlike traditional IB methods that rely on variational bounds or neural mutual information (MI) estimators, GeoIB eliminates the need for explicit MI estimation by reframing IB objectives as minimal Kullback-Leibler (KL) distances to independence manifolds. GeoIB incorporates two complementary components: (i) a distribution-level Fisher–Rao (FR) discrepancy, which is reparameterization-invariant and matches KL to second order, and (ii) a geometry-level Jacobian–Frobenius (JF) penalty, which discourages pullback volume expansion of the encoder. Additionally, the authors propose a natural-gradient optimization method consistent with the FR metric, ensuring stable and geometry-aware updates. Empirical evaluations demonstrate that GeoIB achieves superior accuracy-compression trade-offs compared to state-of-the-art IB baselines, particularly in high-compression regimes, while improving optimization stability and invariance.
Methodology
GeoIB reframes the IB problem by interpreting mutual information terms as minimal KL distances to independence manifolds. It introduces two complementary penalties: (i) the Fisher–Rao (FR) discrepancy, which provides a distribution-level regularization invariant to reparameterizations, and (ii) the Jacobian–Frobenius (JF) penalty, which controls local capacity by penalizing pullback volume expansion of the encoder. A natural-gradient optimizer consistent with the FR metric is developed to ensure stable and geometry-aware updates. The method is empirically validated on popular datasets against state-of-the-art IB baselines.
Results
GeoIB achieves a better trade-off between prediction accuracy and compression ratio in the information plane compared to mainstream IB baselines. It demonstrates improved robustness and stability in high-compression regimes, outperforming traditional methods in both utility and compression. The experiments confirm that GeoIB's geometry-aware approach enhances invariance and optimization stability.
Implications
GeoIB has potential applications in tasks requiring robust and stable representation learning under strong compression, such as privacy-preserving machine learning, domain generalization, and efficient neural network training. Its geometry-aware framework could inspire further research into leveraging statistical manifold geometry for other machine learning problems.
View on arXiv

Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

Yuxi Guo, Paul Sheridan
  • Greedy-Gnorm dynamically recalculates head importance during pruning using gradient-based scores, addressing the limitations of static scoring methods.
  • The algorithm uses the ℓ2-norms of Q/K/V gradient matrices to compute head importance, ensuring gradient-informed pruning decisions.
  • Greedy-Gnorm outperforms attention entropy (AE) and random baselines in preserving accuracy across multiple transformer architectures.
  • The method introduces an ε-rectified entropy variant to stabilize AE-based methods and prevent numerical instability.
  • Greedy-Gnorm enables substantial model compression while maintaining high task performance, supporting energy-efficient transformer deployment.
Read More
Abstract
This paper introduces Greedy-Gnorm, a novel attention head pruning algorithm for transformer models that dynamically recalculates head importance during pruning. Unlike traditional methods that rely on static importance scores, Greedy-Gnorm uses the elementwise product of the ℓ2-norms of the Q/K/V gradient matrices, estimated from a hold-out validation set, to dynamically update head importance after each pruning step. This approach addresses the limitations of static scoring methods and improves pruning decisions by adapting to evolving model gradients. The authors validate their method on four transformer architectures—BERT, ALBERT, RoBERTa, and XLM-RoBERTa—and demonstrate that Greedy-Gnorm consistently preserves task accuracy under significant head removal. The method outperforms the commonly used attention entropy (AE) approach and a random pruning baseline, achieving smoother pruning trajectories and higher accuracy at equivalent pruning rates. By reducing model size while maintaining performance, Greedy-Gnorm contributes to the development of energy-efficient transformer models, aligning with the goals of Green AI.
Methodology
Greedy-Gnorm employs a dynamic, gradient-driven pruning strategy that recalculates head importance after each pruning step. The importance of each attention head is scored using the elementwise product of the ℓ2-norms of the Q/K/V gradient matrices, estimated from a hold-out validation set. The algorithm iteratively prunes the least important heads and updates the importance scores to reflect changes in model gradients. To address numerical instability in attention entropy (AE) methods, the authors introduce an ε-rectified entropy variant, which adds a small constant to attention probabilities to prevent log(0) errors.
Results
Greedy-Gnorm was evaluated on BERT, ALBERT, RoBERTa, and XLM-RoBERTa. The method consistently preserved task accuracy under significant head pruning. For example, BERT retained 90.08% accuracy after pruning 80% of its attention heads, compared to 96.82% accuracy before pruning. Greedy-Gnorm demonstrated smoother and more reliable pruning trajectories compared to AE and random baselines, achieving higher accuracy at equivalent pruning rates.
Implications
Greedy-Gnorm provides a robust and efficient method for transformer model compression, enabling significant reductions in model size and computational cost without sacrificing task performance. This makes it particularly valuable for deploying large-scale transformer models in resource-constrained environments, contributing to the broader goals of Green AI and sustainable machine learning.
View on arXiv

Hand Gesture Recognition from Doppler Radar Signals Using Echo State Networks

Towa Sano, Gouhei Tanaka
  • This is the first application of Echo State Networks (ESNs) to radar-based hand gesture recognition (HGR).
  • A multi-reservoir ESN architecture is proposed to process heterogeneous feature maps independently, reducing interference and improving performance.
  • The method achieves 98.84% accuracy on the Soli dataset, outperforming existing methods while operating with lower computational costs.
  • The approach is validated using multiple evaluation settings, including cross-validation and leave-one-subject-out testing.
  • The method is particularly suited for resource-constrained environments, such as edge devices, due to its lightweight computational requirements.
Read More
Abstract
This paper introduces an Echo State Network (ESN)-based approach for hand gesture recognition (HGR) using Doppler radar signals, specifically frequency-modulated continuous-wave (FMCW) radar. The proposed method addresses the high computational cost of deep learning models, making it suitable for resource-constrained environments such as edge devices. Radar signals are transformed into feature maps (e.g., range-time and Doppler-time maps) and processed using ESNs, which are a type of reservoir computing model. A novel multi-reservoir architecture is proposed to handle heterogeneous feature maps, mitigating interference and improving recognition performance. The approach is evaluated on two datasets: the Soli dataset (11-class HGR task) and the Dop-NET dataset (4-class HGR task). Results show that the ESN-based method achieves high accuracy (98.84% on the Soli dataset) while significantly reducing computational costs compared to deep learning methods. The study demonstrates the potential of ESNs for efficient and accurate HGR in human-computer interaction (HCI) systems.
Methodology
The method involves transforming raw radar signals into feature maps (range-time and Doppler-time maps) and processing them using Echo State Networks (ESNs). A multi-reservoir architecture is employed, where different feature maps are fed into separate reservoirs to avoid interference. The reservoir states are then integrated and passed to a readout layer, which uses classifiers such as ridge regression, support vector machines (SVMs), or random forests. The approach is evaluated on two datasets: the Soli dataset (11-class task) and the Dop-NET dataset (4-class task).
Results
The proposed ESN-based method achieves 98.84% accuracy on the 11-class HGR task using the Soli dataset, outperforming deep learning-based methods while requiring significantly less computational power. On the Dop-NET dataset, the method also demonstrates competitive performance on a 4-class HGR task. The multi-reservoir architecture is shown to effectively handle heterogeneous feature maps, improving recognition accuracy and efficiency.
Implications
The proposed method has significant implications for human-computer interaction (HCI) systems, particularly in resource-constrained environments such as edge devices, in-vehicle interfaces, and robotic systems. By combining high accuracy with low computational cost, the approach enables real-time gesture recognition in scenarios where deep learning models may be impractical due to hardware limitations.
View on arXiv

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller, Florent Draye, Bernhard Schölkopf
  • The paper introduces an orthogonality regularization technique for sparse autoencoders to improve feature identifiability and interpretability.
  • The method reduces interference and superposition between features, aligning with the Independent Causal Mechanisms (ICM) principle.
  • Empirical results show that the approach maintains performance on reasoning tasks while improving interpretability and enabling isolated interventions.
  • Theoretical analysis connects feature interference to intervenability, leveraging finite frame theory.
  • The method allows for human-interpretable interventions, as demonstrated by feature-swapping experiments in language models.
Read More
Abstract
This paper introduces a novel approach to improve the identifiability, interpretability, and intervenability of features in sparse autoencoders (SAEs) by enforcing an orthogonality regularization on the decoder matrix. The authors argue that high self-coherence in feature dictionaries, a common issue in language model (LM) representations, hinders the identifiability of features and their causal interpretability. By applying an orthogonality penalty, the proposed method reduces interference and superposition between features, ensuring that features are more distinct and interpretable. This approach aligns with the Independent Causal Mechanisms (ICM) principle, promoting modular representations that allow for isolated causal interventions. Empirical results demonstrate that the method maintains performance on downstream tasks, enhances interpretability, and enables controlled interventions in the representation space without unintended side effects. The paper also provides theoretical insights into the relationship between feature interference and intervenability using finite frame theory.
Methodology
The authors fine-tune a language model (LM) around a sparse autoencoder (SAE) with an orthogonality penalty applied to the decoder matrix. This regularization ensures that the learned features are nearly orthogonal, reducing interference and promoting distinct, interpretable representations. The approach is evaluated on mathematical reasoning tasks, interpretability metrics, and feature-swapping experiments to test intervenability. Theoretical analysis is conducted using finite frame theory to relate feature interference to intervenability.
Results
The proposed method achieves comparable performance to non-penalized SAEs on reasoning tasks while significantly improving interpretability by reducing feature similarity. Embedded feature explanations become more distinct under the orthogonality penalty, and the method enables isolated interventions in the representation space. Feature-swapping experiments demonstrate that the model maintains reasoning capabilities while correctly adapting to the swapped features.
Implications
This work has implications for improving the reliability and interpretability of machine learning models, particularly in applications requiring causal reasoning and controlled interventions. The method could be applied to enhance modularity in representations for tasks like explainable AI, causal inference, and human-in-the-loop systems, where interpretable and intervenable features are critical.
View on arXiv

Improved Dimension Dependence for Bandit Convex Optimization with Gradient Variations

Hang Yu, Yu-Hu Yan, Peng Zhao
  • Improves dimension dependence in gradient-variation regret bounds for two-point BCO, achieving tighter results for convex and strongly convex functions.
  • Introduces a refined analysis of non-consecutive gradient variation, addressing the inherent challenges of bandit feedback.
  • Extends the methodology to one-point bandit linear optimization, achieving the first gradient-variation bounds in this setting.
  • Applies the techniques to dynamic and universal regret minimization, as well as bandit games, establishing new bounds and faster convergence rates.
  • Provides a theoretical framework that bridges adversarial and stochastic optimization in bandit settings.
Read More
Abstract
This paper addresses the challenge of improving the dimension dependence in Bandit Convex Optimization (BCO) with gradient variation, a key problem in online learning with limited feedback. The authors focus on the two-point feedback setting, where the learner queries two function values per round, and propose a refined analysis of non-consecutive gradient variation, a critical quantity in bandit optimization. Their work improves the dimension dependence for both convex and strongly convex functions compared to prior results by Chiang et al. (2013). Additionally, the authors extend their techniques to one-point bandit linear optimization and establish the first gradient-variation bounds for this setting. They also demonstrate the applicability of their methods to more complex tasks, such as dynamic and universal regret minimization, as well as bandit games, achieving new bounds and faster convergence rates. This work advances the theoretical understanding of gradient-variation regret in bandit settings and provides tighter guarantees for high-dimensional problems.
Methodology
The authors propose a refined analysis of non-consecutive gradient variation, which decouples the dependencies arising from the sampling gap in bandit feedback. This involves unraveling the correlation structure in gradient estimations across rounds. They leverage this analysis to derive improved regret bounds for two-point BCO and extend it to one-point bandit linear optimization. The methodology also incorporates problem-dependent quantities like gradient variance and small-loss regret to achieve tighter bounds.
Results
The paper achieves improved dimension-dependent regret bounds for two-point BCO: O(d^(3/2)√VT) for convex functions and O(d/λ log VT) for strongly convex functions, improving upon the previous best results by Chiang et al. (2013). For one-point bandit linear optimization, the authors derive the first gradient-variation bounds over hyper-rectangular domains. Additionally, they establish gradient-variation-based dynamic and universal regret bounds for two-point BCO and demonstrate faster convergence rates in bandit games.
Implications
This work has significant implications for high-dimensional online learning and optimization under limited feedback. The improved dimension dependence makes bandit optimization more practical for real-world applications in areas such as game theory, reinforcement learning, and adversarial machine learning. The techniques introduced could also inspire further research into adaptive algorithms and problem-dependent guarantees in other bandit and optimization settings.
View on arXiv

Interval-Based AUC (iAUC): Extending ROC Analysis to Uncertainty-Aware Classification

Yuqi Li, Matthew M. Engelhard
  • Introduces AUCL and AUCU as new metrics for evaluating interval-valued predictions, capturing uncertainty in ranking performance.
  • Proposes a three-region decomposition of the ROC plane into confident correct, confident incorrect, and uncertain rankings.
  • Demonstrates that AUCL and AUCU provide formal lower and upper bounds on the theoretical optimal AUC (AUC*).
  • Supports selective prediction by allowing models to abstain from ranking ambiguous cases, optimizing the trade-off between abstention and reliability.
  • Empirical experiments validate the framework's theoretical properties and practical utility.
Read More
Abstract
This paper introduces a novel framework for evaluating uncertainty-aware classification models that produce interval-valued predictions instead of point estimates. Traditional evaluation metrics like the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) are not designed to handle predictive uncertainty, as they rely on scalar scores. The authors propose an interval-based ROC framework that incorporates uncertainty into the evaluation process by defining two new metrics, AUCL and AUCU, which represent lower and upper bounds on ranking performance. These metrics enable a three-region decomposition of the ROC plane into confident correct, confident incorrect, and uncertain pairwise rankings. The framework also supports selective prediction, allowing models to abstain from ranking ambiguous cases, thereby balancing abstention rates with discriminative reliability. Theoretical analysis demonstrates that AUCL and AUCU provide bounds on the optimal AUC (AUC*), which represents the theoretical limit of discrimination performance. Empirical validation on real-world datasets, such as the Pima Indians Diabetes dataset, confirms the framework's correctness and utility for uncertainty-aware evaluation and decision-making.
Methodology
The authors define new ROC-style curves for interval-valued predictions based on strict and relaxed criteria for ranking intervals. They introduce AUCL and AUCU as probabilistic measures of ranking performance and prove their theoretical properties, including their relationship to the optimal AUC (AUC*). The framework is validated empirically using real-world datasets, with interval predictions generated via bootstrap-based methods.
Results
The proposed framework successfully quantifies ranking uncertainty and provides bounds on the optimal AUC. Empirical experiments on the Pima Indians Diabetes dataset confirm the theoretical equivalences and demonstrate how interval width affects the three-region decomposition of the ROC plane. The results highlight the framework's ability to evaluate uncertainty-aware classifiers effectively.
Implications
This framework has significant implications for high-stakes decision-making tasks, such as medical risk prediction, where uncertainty quantification is critical. By integrating uncertainty into evaluation metrics, the approach enables more reliable and interpretable model assessments, supports selective prediction, and provides insights into the limits of achievable discrimination. It can be applied broadly to any interval-valued prediction models, regardless of the method used to construct the intervals.
View on arXiv

Legendre Memory Unit with A Multi-Slice Compensation Model for Short-Term Wind Speed Forecasting Based on Wind Farm Cluster Data

Mumin Zhang, Haochen Zhang, Xin Zhi Khoo, Yilin Zhang, Nuo Chen, Ting Zhang, Junjie Tang
  • The WMF-CPK-MSLMU ensemble model integrates data preprocessing, forecasting, and multi-slice compensation to enhance wind speed prediction accuracy.
  • Legendre Memory Unit (LMU) effectively captures spatial-temporal correlations by modeling both linear and nonlinear dependencies among wind farms.
  • The compensating parameter based on Kendall rank correlation coefficient (CPK) adaptively weights spatial correlations and complements missing data.
  • The model outperforms traditional statistical, machine learning, and deep learning approaches in short-term wind speed forecasting for wind farm clusters.
  • The approach is robust and scalable, making it suitable for large-scale wind farm clusters with complex spatial dependencies.
Read More
Abstract
This paper addresses the challenge of short-term wind speed forecasting for wind farm clusters, a critical task for optimizing power system operations. The authors propose an innovative ensemble model, WMF-CPK-MSLMU, which integrates weighted mean filtering (WMF), a multi-slice Legendre Memory Unit (MSLMU), and a compensating parameter based on Kendall rank correlation coefficient (CPK). The model leverages spatial-temporal correlations within wind farm cluster data to improve prediction accuracy, robustness, and computational efficiency. Key contributions include the use of LMU to jointly model linear and nonlinear dependencies, the introduction of CPK-derived weights to enhance spatial correlation modeling, and adaptive compensation for missing data. Experimental results demonstrate the superiority of the proposed model over existing methods in terms of accuracy and robustness across various wind farm cluster datasets.
Methodology
The methodology involves three main components: (1) Weighted Mean Filtering (WMF) for denoising wind speed data at the single-farm level; (2) Multi-Slice Legendre Memory Unit (MSLMU), which uses LMU combined with CPK-derived weights to model spatial-temporal correlations and enhance forecasting; and (3) Adaptive compensation using CPK to address missing data and improve robustness. The ensemble model is trained using backpropagation to optimize spatial-temporal dependencies across clustered wind farms.
Results
The proposed WMF-CPK-MSLMU model demonstrated superior performance in short-term wind speed forecasting compared to existing methods, achieving higher accuracy and robustness across multiple wind farm cluster datasets. The model effectively captured spatial-temporal correlations and compensated for missing data, resulting in reliable predictions under varying conditions.
Implications
This research has significant implications for renewable energy integration and power system optimization. The proposed model can be applied to improve the operational efficiency of wind farm clusters, enhance grid stability, and support sustainable energy development. Additionally, the methodology can be extended to other spatial-temporal forecasting tasks in domains such as meteorology and environmental monitoring.
View on arXiv

Let Experts Feel Uncertainty: A Multi-Expert Label Distribution Approach to Probabilistic Time Series Forecasting

Zhen Zhou, Zhirui Wang, Qi Hong, Yunyang Shi, Ziyuan Gu, Zhiyuan Liu
  • Introduces Multi-Expert LDL and Pattern-Aware LDL-MoE frameworks for probabilistic time series forecasting.
  • Combines mixture-of-experts architectures with distributional learning to capture diverse temporal patterns and quantify uncertainty.
  • Pattern-Aware LDL-MoE decomposes time series into interpretable components, enabling precise uncertainty attribution.
  • Achieves superior predictive accuracy and interpretability compared to baseline methods on the M5 dataset.
  • Provides actionable insights for decision-making in domains such as retail, supply chain, and finance.
Read More
Abstract
This paper introduces a novel framework for probabilistic time series forecasting that balances predictive accuracy with interpretable uncertainty quantification. The authors propose two complementary methods: Multi-Expert Label Distribution Learning (LDL) and Pattern-Aware LDL-MoE. The Multi-Expert LDL employs multiple experts with distinct learned parameters to capture diverse temporal patterns, while the Pattern-Aware LDL-MoE decomposes time series into interpretable components such as trend, seasonality, changepoints, and volatility using specialized sub-experts. Both methods extend traditional point prediction approaches to distributional learning, enabling richer uncertainty quantification through Maximum Mean Discrepancy (MMD). The framework is evaluated on aggregated sales data from the M5 dataset, demonstrating superior performance in predictive accuracy compared to baseline models. Additionally, the Pattern-Aware LDL-MoE provides enhanced interpretability by attributing uncertainty to specific components, making it suitable for real-world applications requiring actionable insights.
Methodology
The proposed framework employs mixture-of-experts architectures combined with Label Distribution Learning (LDL) to model predictive distributions. Multi-Expert LDL uses multiple experts with diverse learned parameters to capture temporal patterns, while Pattern-Aware LDL-MoE decomposes time series into interpretable components (trend, seasonality, changepoints, volatility) using specialized sub-experts. Maximum Mean Discrepancy (MMD) is used for uncertainty quantification, and the models are trained on aggregated sales data from the M5 dataset.
Results
The Multi-Expert LDL framework achieved the best overall predictive accuracy, while the Pattern-Aware LDL-MoE provided enhanced interpretability by attributing uncertainty to specific components. Both methods outperformed baseline approaches in forecasting accuracy and uncertainty quantification on the M5 dataset.
Implications
The proposed frameworks have significant implications for real-world applications where both predictive accuracy and uncertainty quantification are critical. They can be applied in domains such as retail forecasting, supply chain management, and financial planning, enabling decision-makers to better understand and act on uncertainty in dynamic environments. The interpretability of the Pattern-Aware LDL-MoE also supports actionable insights for risk management and strategy development.
View on arXiv

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane
  • LoRDO addresses the communication and memory bottlenecks of distributed training by combining low-rank optimization with infrequent synchronization.
  • The framework introduces a full-rank quasi-hyperbolic momentum term to prevent subspace stagnation caused by global low-rank projections.
  • LoRDO achieves near-parity with low-rank DDP in perplexity and accuracy while reducing communication overhead by 8× to 12×.
  • In memory-constrained settings, LoRDO outperforms DDP, achieving up to 4.7% better perplexity and demonstrating robustness to small-batch regimes.
  • The framework is scalable to model sizes ranging from 125M to 720M parameters, making it suitable for large-scale language modeling and downstream tasks.
Read More
Abstract
This paper introduces LoRDO, a distributed optimization framework that combines low-rank optimization with infrequent communication to address the memory and bandwidth constraints in large-scale distributed training. Traditional distributed data-parallel (DDP) training is limited by the high communication overhead of synchronizing optimizer states across workers. While low-rank optimizers reduce memory and communication requirements, they suffer from performance degradation in the local-update regime due to noisy projections and restricted subspace exploration. LoRDO resolves these issues by introducing a full-rank quasi-hyperbolic momentum term into the optimization process, which restores subspace exploration while maintaining the efficiency benefits of low-rank structures. The framework achieves near-parity with synchronous low-rank DDP in terms of perplexity and downstream task accuracy, while reducing communication overhead by approximately 10×. LoRDO also demonstrates superior performance in memory-constrained settings, outperforming DDP in perplexity by up to 4.7% under low-rank and small-batch conditions.
Methodology
LoRDO combines low-rank optimization with infrequent synchronization in distributed training. It uses global projections derived from aggregated pseudo-gradients to reduce noise and maintain a stable low-rank subspace. To address the issue of subspace stagnation, LoRDO incorporates a full-rank quasi-hyperbolic momentum term into the local updates, enabling full subspace exploration without increasing memory or communication overhead. The framework also integrates aligned momenta and error feedback mechanisms to optimize performance.
Results
LoRDO reduces communication overhead by approximately 10× compared to low-rank DDP while maintaining near-parity in perplexity (gap < 1%) and downstream task accuracy. In memory-constrained settings, LoRDO surpasses DDP, achieving 3.36–4.7% better perplexity. The framework demonstrates scalability across model sizes from 125M to 720M parameters and exhibits resilience to small-batch regimes, outperforming local projection methods.
Implications
LoRDO has significant implications for large-scale distributed training, particularly in resource-constrained environments. By reducing memory and communication requirements, it enables the training of larger models on limited hardware. Its scalability and efficiency make it a promising approach for training foundation models in natural language processing and other domains requiring distributed optimization.
View on arXiv

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana María Tárano, Hannah Kerner
  • Proposes Multi-Head LatentMoE and Head Parallel (HP) to address communication inefficiencies in Mixture of Experts (MoE) training.
  • Achieves O(1) communication cost, perfect load balance, and deterministic communication patterns, overcoming limitations of Expert Parallel (EP).
  • Introduces IO-aware routing and expert computation to optimize memory and computation, reducing high-bandwidth memory (HBM) access.
  • Demonstrates up to 1.61× faster training speeds compared to standard MoE with EP, with identical or improved model performance.
  • Enables more efficient and accessible training of multi-billion-parameter foundation models for the academic community.
Read More
Abstract
This paper introduces Multi-Head LatentMoE and Head Parallel (HP), a novel architecture and parallelism strategy for training sparse Mixture of Experts (MoE) models. MoE architectures enable large language models to scale efficiently by activating only a subset of experts for each input token. However, the standard training method, Expert Parallel (EP), suffers from high communication costs, load imbalance, and non-deterministic communication patterns. The proposed Multi-Head LatentMoE decomposes a single MoE into multiple smaller, independent modules, each processing sub-tokens, while Head Parallel ensures that all routing and expert computations occur locally on GPUs, eliminating the need for costly inter-GPU communication. The authors also introduce IO-aware routing and expert computation to optimize memory and computation efficiency. Experiments demonstrate that the proposed method achieves up to 1.61× faster training speeds compared to standard MoE with EP, while maintaining or improving model performance. This work significantly reduces the computational and communication overhead of training large-scale MoE models, making them more accessible to the research community.
Methodology
The authors propose Multi-Head LatentMoE, which splits input tokens into sub-tokens processed by independent MoE modules, and Head Parallel, which ensures all routing and expert computations occur locally on GPUs. IO-aware routing and expert computation are developed to reduce memory and IO costs by performing operations directly in SRAM and leveraging block-sparse attention techniques. The approach is evaluated on a 10B-token language modeling task using the FineWebEdu dataset.
Results
The proposed method achieves up to 1.61× faster training speeds compared to standard MoE with EP while maintaining identical performance. When doubling the granularity, it achieves higher overall accuracy (up to 6.9 percentage points improvement) while still being 1.11× faster. Inter-GPU communication volume is reduced to 25% when activating four experts (k=4).
Implications
This work significantly reduces the computational and communication overhead of training large-scale sparse MoE models, making them more accessible to researchers with limited computational resources. It also provides a scalable and efficient framework for training multi-billion-parameter foundation models, which could accelerate advancements in natural language processing and other machine learning domains.
View on arXiv

Protein Autoregressive Modeling via Multiscale Structure Generation

Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
  • PAR is the first multi-scale autoregressive framework for protein backbone generation, leveraging hierarchical protein structures for coarse-to-fine modeling.
  • The framework avoids discretization loss and unidirectional ordering limitations by directly modeling atomic coordinates and using multi-scale representations.
  • Exposure bias is mitigated through noisy context learning and scheduled sampling, improving structure generation quality.
  • PAR exhibits strong zero-shot generalization, supporting tasks like prompt-based generation and motif scaffolding without fine-tuning.
  • The multi-scale approach enables faster sampling (2.5x speedup) and competitive performance on distributional metrics like Fréchet Protein Structure Distance (FPSD).
Read More
Abstract
This paper introduces Protein Autoregressive Modeling (PAR), a novel multi-scale autoregressive framework for generating protein backbone structures. PAR leverages the hierarchical nature of protein structures to generate coarse-to-fine representations, mimicking the process of sculpting a statue. The framework consists of three key components: multi-scale downsampling operations to represent protein structures at varying granularities, an autoregressive transformer to encode multi-scale information and produce conditional embeddings, and a flow-based backbone decoder to generate atomic coordinates conditioned on these embeddings. To address exposure bias, a common issue in autoregressive models, PAR employs noisy context learning and scheduled sampling, enhancing robustness and generation quality. PAR demonstrates strong zero-shot generalization capabilities, enabling flexible conditional generation and motif scaffolding without fine-tuning. It achieves competitive results on unconditional generation benchmarks, exhibits favorable scaling behavior, and provides a 2.5x speedup in sampling compared to single-scale baselines. The framework establishes itself as a promising tool for protein structure generation, with potential applications in biomedicine and nanotechnology.
Methodology
PAR employs a multi-scale autoregressive framework with three components: (i) multi-scale downsampling to create hierarchical structural representations, (ii) an autoregressive transformer to encode multi-scale information and generate conditional embeddings, and (iii) a flow-based backbone decoder to model atomic coordinates directly. Training incorporates noisy context learning and scheduled sampling to address exposure bias.
Results
PAR achieves high-quality protein backbone generation, demonstrating competitive performance on unconditional generation benchmarks and favorable scaling behavior. It supports zero-shot generalization for tasks like motif scaffolding and prompt-based generation, and achieves a 2.5x speedup in sampling compared to single-scale baselines.
Implications
PAR has significant potential in protein design and modeling, with applications in biomedicine and nanotechnology. Its ability to generate high-quality structures and generalize to unseen tasks could accelerate advancements in drug discovery, synthetic biology, and materials science.
View on arXiv

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi
  • QUATRO introduces a principled query-adaptive trust-region optimization framework for LLM fine-tuning, addressing limitations of heuristic-based methods like GRPO.
  • The method eliminates the need for importance-ratio clipping, resolving issues such as gradient masking and instability under policy staleness.
  • QUATRO maintains controlled entropy, preventing premature entropy collapse and ensuring robust exploration during training.
  • Empirical results show significant improvements in Pass@k and UCC@k metrics on mathematical reasoning tasks, with gains increasing as the sampling budget grows.
  • The approach demonstrates strong robustness to hyperparameter variations, including learning rates and rollout reuse.
Read More
Abstract
This paper introduces QUATRO (Query-Adaptive Trust-Region Policy Optimization), a novel reinforcement learning (RL)-based algorithm for fine-tuning large language models (LLMs). Existing methods like Group Relative Policy Optimization (GRPO) rely on heuristic mechanisms such as importance-ratio clipping to enforce trust-region constraints, which can lead to instability, gradient masking, and suboptimal performance. QUATRO addresses these limitations by formulating a principled, query-conditioned trust-region optimization problem. Through a Lagrangian dual analysis, the authors derive an exact query-adaptive objective that directly enforces trust-region constraints without relying on heuristic approximations. This approach ensures stable training, mitigates entropy collapse, and maintains robust performance under varying hyperparameters. Empirical evaluations on mathematical reasoning benchmarks demonstrate that QUATRO outperforms GRPO-style baselines in both accuracy (Pass@k) and diversity (Unique Correct Count, UCC@k), particularly as the sampling budget increases.
Methodology
The authors propose a query-conditioned trust-region optimization framework that explicitly accounts for task heteroscedasticity. Using a Lagrangian dual analysis, they derive an exact query-adaptive objective that enforces trust-region constraints without heuristic approximations. This approach ensures controlled policy updates, mitigates gradient masking, and stabilizes training dynamics. The method is empirically validated on mathematical reasoning benchmarks using metrics like Pass@k and UCC@k.
Results
QUATRO consistently outperforms GRPO-style baselines on mathematical reasoning tasks, achieving higher Pass@k accuracy and greater diversity in generated solutions (UCC@k). The method demonstrates stable training dynamics, robustness to policy staleness, and resilience to hyperparameter variations such as learning rates. Gains are particularly pronounced as the sampling budget (k) increases.
Implications
QUATRO's principled approach to trust-region optimization has significant implications for fine-tuning LLMs in complex reasoning tasks like mathematics and coding. By improving stability, accuracy, and diversity, the method could enhance the performance of LLMs in real-world applications requiring long-horizon reasoning, such as automated theorem proving, scientific discovery, and advanced problem-solving.
View on arXiv

REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency

Ondrej Tybl, Lukas Neumann
  • REDistill introduces a robust power divergence loss to handle noisy teacher predictions in Knowledge Distillation.
  • The method adaptively downweights unreliable teacher outputs while preserving informative logit relationships.
  • REDistill eliminates the need for model-specific hyperparameter tuning, enhancing generalizability across teacher–student pairs.
  • Extensive experiments on CIFAR-100 and ImageNet-1k show consistent accuracy improvements over existing KD methods.
  • The framework is computationally efficient and integrates seamlessly into existing KD pipelines.
Read More
Abstract
This paper introduces REDistill, a novel framework for Knowledge Distillation (KD) that addresses the challenge of noisy and overconfident teacher predictions. Traditional KD methods rely on the Kullback–Leibler (KL) divergence to align the predictive distributions of teacher and student models, but this approach assumes that teacher predictions are always reliable. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence from robust statistics. This new loss function adaptively downweights unreliable teacher outputs while preserving the essential relationships between logits. The method is simple, computationally efficient, and integrates seamlessly into existing KD pipelines without requiring extensive hyperparameter tuning. Experimental results on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student model accuracy across a wide range of teacher–student architectures, achieving state-of-the-art performance without model-specific adjustments. The framework's robustness and generalizability make it a promising solution for real-world KD applications.
Methodology
REDistill replaces the standard KL divergence-based KD objective with a power divergence loss, a robust statistical measure that dynamically adjusts the influence of unreliable teacher predictions. The method is theoretically grounded, requiring minimal hyperparameter tuning, and is designed to preserve the relationships between logits while mitigating the impact of noisy teacher outputs. The approach is evaluated on diverse teacher–student architectures using CIFAR-100 and ImageNet-1k datasets.
Results
REDistill consistently outperforms existing KD methods across 14 different teacher–student pairs on CIFAR-100 and ImageNet-1k datasets. It achieves these improvements without requiring model-specific hyperparameter tuning, demonstrating strong generalization to unseen teacher–student pairs. The method incurs negligible computational overhead and maintains a simple integration process.
Implications
REDistill has significant implications for the deployment of efficient and robust deep learning models in resource-constrained environments. By improving the reliability and generalizability of Knowledge Distillation, it can facilitate the development of smaller, more efficient models without sacrificing performance. This makes it particularly valuable for applications in edge computing, mobile devices, and other scenarios where computational resources are limited.
View on arXiv

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen
  • ELBO-based likelihood estimation is the dominant factor driving efficient and stable RL optimization for diffusion models.
  • The choice of policy-gradient objectives has a comparatively smaller impact on performance compared to likelihood estimation strategies.
  • ODE-based sampling methods provide additional efficiency and stability benefits during training and evaluation.
  • The proposed method achieves state-of-the-art performance across multiple benchmarks, including GenEval, OCR, and DrawBench.
  • The approach is up to 4.6× more efficient than FlowGRPO and 2× more efficient than DiffusionNFT.
Read More
Abstract
This paper investigates the design space of reinforcement learning (RL) for fine-tuning diffusion models, particularly for visual tasks such as text-to-image generation. The authors systematically analyze three key components: policy-gradient objectives, likelihood estimation methods, and sampling strategies, to identify the primary drivers of efficiency and performance. They find that likelihood estimation, specifically using evidence lower bound (ELBO)-based estimators computed from the final generated sample, is the dominant factor in achieving effective and stable RL optimization. This approach outperforms trajectory-based estimators, which are computationally expensive and slow to converge. The study also highlights the benefits of ODE-based sampling methods for improved efficiency and stability. The proposed methodology achieves state-of-the-art performance across multiple benchmarks, including GenEval, OCR, and DrawBench, while being significantly more efficient than existing methods such as FlowGRPO and DiffusionNFT.
Methodology
The authors conduct a systematic study of RL design choices for diffusion models, focusing on policy-gradient objectives, likelihood estimation methods, and sampling strategies. They compare trajectory-based estimators with ELBO-based estimators and evaluate the impact of SDE-based and ODE-based sampling methods. Controlled experiments are performed on Stable Diffusion 3.5 Medium (SD3.5-M) to assess performance and efficiency across multiple reward benchmarks.
Results
The study demonstrates that ELBO-based likelihood estimation significantly improves optimization efficiency and performance, achieving a GenEval score of 0.95 in 90 GPU hours. This method is 4.6× more efficient than FlowGRPO and 2× more efficient than DiffusionNFT. ODE-based sampling further enhances efficiency and stability, requiring fewer function evaluations and aligning well with deterministic evaluation procedures.
Implications
The findings have significant implications for the development of more efficient and effective RL-based fine-tuning methods for diffusion models. By prioritizing likelihood estimation strategies, researchers can achieve better performance with reduced computational costs. This work also opens avenues for improving generative tasks such as text-to-image and text-to-video synthesis, enabling more precise control and alignment with external reward signals.
View on arXiv

Reversible Deep Learning for 13C NMR in Chemoinformatics: On Structures and Spectra

Stefan Kuhn, Vandana Dwarka, Przemyslaw Karol Grenda, Eero Vainikko
  • Introduces a reversible deep learning model using conditional invertible neural networks (INNs) for 13C NMR spectroscopy.
  • Unifies spectrum prediction and structure elucidation tasks within a single end-to-end framework.
  • Employs i-RevNet-style bijective blocks to ensure exact invertibility and information preservation.
  • Captures uncertainty in the inverse mapping (spectrum → structure) using latent variables and probabilistic modeling.
  • Demonstrates above-chance accuracy in spectrum prediction and meaningful structural inference during validation.
Read More
Abstract
This paper introduces a reversible deep learning model based on conditional invertible neural networks (INNs) for 13C nuclear magnetic resonance (NMR) spectroscopy, addressing both spectrum prediction (structure → spectrum) and structure elucidation (spectrum → structure) within a unified framework. The model employs i-RevNet-style bijective blocks to ensure exact invertibility, enabling bidirectional mapping between molecular structures and their corresponding NMR spectra. The forward direction predicts a 128-bit binned spectrum code from graph-based molecular representations, while the inverse direction generates structure candidates from spectral data, explicitly capturing the inherent uncertainty and one-to-many nature of the spectrum-to-structure mapping. The approach leverages latent dimensions to encode residual variability, ensuring information preservation and probabilistic modeling. Experimental results demonstrate the model's ability to achieve above-chance accuracy in spectrum prediction, numerical invertibility on trained examples, and meaningful structural signals during inverse inference. This work highlights the potential of invertible architectures to unify predictive modeling and uncertainty-aware inference in chemoinformatics.
Methodology
The model is built using i-RevNet-style bijective blocks to ensure invertibility, enabling bidirectional mapping between molecular structures and spectra. It predicts a 128-bit binned spectrum code from graph-based molecular representations and uses latent dimensions to encode residual variability. The inverse mapping generates structure candidates from spectral data, explicitly modeling uncertainty. Training involves optimizing the network for both directions using conditional invertible neural network principles.
Results
The model achieves numerical invertibility on trained examples, above-chance accuracy in spectrum-code prediction, and produces coarse but meaningful structural signals during inverse inference on validation spectra. These results validate the feasibility of using invertible architectures for unified spectrum prediction and structure elucidation tasks.
Implications
This work has significant implications for chemoinformatics, particularly in computer-aided structure elucidation (CASE) systems and NMR spectrum prediction. The reversible deep learning approach could enhance the accuracy and efficiency of molecular identification and characterization, while also providing uncertainty-aware candidate generation for ambiguous spectral data. The methodology may inspire broader applications in other domains requiring bidirectional mappings between physical parameters and observations.
View on arXiv

Safe Urban Traffic Control via Uncertainty-Aware Conformal Prediction and World-Model Reinforcement Learning

Joydeep Chandra, Satyam Kumar Navneet, Aleksandr Algazinov, Yong Zhang
  • Introduces STREAM-RL, a unified framework that integrates uncertainty-aware forecasting, anomaly detection, and safe RL with theoretical guarantees.
  • Proposes PU-GAT+, a novel graph attention mechanism that dynamically reweights attention based on uncertainty, ensuring calibrated predictions.
  • Develops CRFN-BY, a conformal residual flow network that achieves robust anomaly detection with valid FDR control under dependence.
  • Presents LyCon-WRL+, a safe RL agent with Lyapunov stability certificates and certified Lipschitz bounds, ensuring safe policy learning.
  • Achieves significant improvements in safety rate (95.2% vs. 69% for standard PPO) and reward, with 23ms end-to-end inference latency.
Read More
Abstract
This paper introduces STREAM-RL, a novel framework for safe urban traffic control that integrates uncertainty-aware forecasting, anomaly detection, and safe reinforcement learning (RL) with end-to-end theoretical guarantees. The framework addresses key challenges in urban traffic management, including the need for calibrated uncertainty in predictions, robust anomaly detection under dependence, and safe adaptive control with verified stability. STREAM-RL comprises three main components: PU-GAT+, an uncertainty-guided graph attention mechanism for forecasting; CRFN-BY, a conformal residual flow network for robust anomaly detection with false discovery rate (FDR) control; and LyCon-WRL+, a safe RL agent with Lyapunov stability certificates and certified Lipschitz bounds. The framework propagates calibrated uncertainty across all stages, ensuring reliable and safe decision-making. Experimental results on real-world traffic data demonstrate significant improvements in coverage efficiency, FDR control, safety, and reward compared to baseline methods, with low computational overhead.
Methodology
The authors propose three novel components: (1) PU-GAT+, which uses confidence-monotonic attention to dynamically reweight graph attention based on prediction uncertainty; (2) CRFN-BY, which models uncertainty-normalized residuals using normalizing flows and applies the Benjamini-Yekutieli procedure for robust FDR control; and (3) LyCon-WRL+, a safe RL agent that employs spectral normalization to enforce Lipschitz bounds and derives Lyapunov stability certificates. The framework integrates these components to propagate calibrated uncertainty from forecasting through anomaly detection to safe policy learning. The methods are evaluated on real-world traffic trajectory datasets.
Results
STREAM-RL achieves 91.4% coverage efficiency, controls FDR at 4.1% under verified dependence, and improves the safety rate to 95.2% compared to 69% for standard PPO. It also achieves higher rewards and maintains a low end-to-end inference latency of 23ms, demonstrating its practical applicability for real-time urban traffic control.
Implications
The proposed framework has significant implications for urban traffic management, enabling safer and more efficient traffic control systems. By providing calibrated uncertainty estimates, robust anomaly detection, and verified safe control, STREAM-RL can help reduce traffic congestion, improve road safety, and minimize environmental impact. The framework's low computational overhead makes it suitable for real-time deployment in smart city applications.
View on arXiv