AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Geodesics of Dynamic Graphs for Regime Change Detection
Graph Learning
Time Series
Theory
- Introduces a geodesic-based framework for detecting regime changes in dynamic graphs.
- Defines regimes as coherent dynamics characterized by geodesics in graph space.
- Outperforms existing change point detection methods on synthetic and real-world data.
- Aligns detected change points with significant external events during the Covid-19 pandemic.
Read more
Geodesics of Dynamic Graphs for Regime Change Detection
Summary
This paper addresses the limitations of traditional change point detection methods in dynamic networks, which typically assume abrupt transitions between stationary states. The authors propose a novel framework that defines regimes as periods of coherent dynamics in temporal graphs, characterized as trajectories along geodesics in a defined graph space. This approach allows for the detection of regime changes as significant drifts in dynamics, either toward new trajectories or with changes in pace. The authors utilize graph regression techniques to measure the cumulative distance of observed graph sequences from estimated geodesics, integrating this with change point detection algorithms. Through experiments on dynamic networks with varying trajectories and speeds, the proposed method outperforms existing state-of-the-art change point detection models. Additionally, the framework is applied to mobility data during the Covid-19 pandemic, revealing that the detected change points align more closely with external events compared to baseline methods. This work represents a significant advancement in modeling and detecting changes in evolving regimes within graph space, offering a robust tool for analyzing complex temporal graph data.
Methodology
The authors formulate the change point detection problem as identifying geodesics within observed graph sequences. They introduce a Residual Sum of Squares (RSS) cost function to quantify the alignment of graph subsequences with geodesics connecting their endpoints. The framework includes two implementations based on different graph metrics and proposes strategies for sampling graphs from continuous geodesics.
Results
The proposed method consistently outperformed established change point detection techniques in experiments on synthetic data and real-world mobility data. The analysis of mobility data during the Covid-19 pandemic demonstrated that the method identified meaningful shifts corresponding to major lockdown events, showing a better alignment with external circumstances than baseline methods.
Implications
This research provides a powerful tool for analyzing dynamic networks in various applications, such as social network analysis, communication systems, and biological processes. The ability to detect gradual changes in regimes can enhance understanding of complex systems and inform decision-making in real-time scenarios.
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing
NLP
Large Language Models
Efficient ML
- WhiFlash is the first cross-paradigm speculative decoding method that integrates autoregressive and diffusion-based drafting.
- The method employs a token-level routing mechanism to dynamically select the most effective drafting model during inference.
- Novel cache-management optimizations reduce switching overhead to below 7% of per-round latency.
- WhiFlash achieves significant throughput gains compared to state-of-the-art models, enhancing performance in diverse tasks.
Read more
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing
Summary
The paper introduces WhiFlash, a novel method for accelerating inference in large language models (LLMs) by addressing the limitations of existing speculative decoding (SD) approaches. Traditional SD methods rely on static drafting paradigms, which do not adapt well to the fluctuating accuracy of draft models during token generation. WhiFlash unifies autoregressive and diffusion-based drafting under a single token-level controller, allowing for dynamic routing between these paradigms based on real-time performance. This method employs a fine-grained routing mechanism, utilizing either a lightweight entropy-based approach or a learned neural policy to optimize the balance between token gain and latency. Additionally, the authors introduce cache-management optimizations, Lazy Catch-up and KV-only Prefill, which significantly reduce the overhead associated with high-frequency switching. The empirical results demonstrate that WhiFlash achieves higher acceptance lengths and throughput gains of up to 69.6% over the autoregressive model EAGLE-3 and 37.3% over the diffusion-based model DFlash, showcasing its effectiveness in handling complex agentic workloads.
Methodology
WhiFlash utilizes a token-level router that dynamically selects between autoregressive and diffusion-based drafting models based on their performance at each decoding step. The routing decision can be made using either an entropy-based method or a learned neural policy. The authors also implement cache-management strategies to minimize latency during frequent model switching.
Results
WhiFlash demonstrates substantial improvements in acceptance lengths and throughput, achieving gains of up to 69.6% over EAGLE-3 and 37.3% over DFlash across various tasks. The method effectively addresses the inefficiencies of static drafting paradigms by adapting to the dynamic nature of token generation.
Implications
The advancements presented in WhiFlash could lead to more efficient deployment of large language models in real-time applications, particularly in agentic systems that require rapid and accurate decision-making. This method may enhance the performance of LLMs in diverse tasks, including reasoning, coding, and planning.
$Ξ±$-PFN: Fast Entropy Search via In-Context Learning
Optimization
Efficient ML
- Introduces $Ξ±$-PFN to improve the efficiency of Entropy Search in Bayesian optimization.
- Utilizes a two-stage amortization strategy with Prior-data Fitted Networks for rapid acquisition function evaluation.
- Achieves significant speed improvements (over 50x) compared to traditional Monte Carlo methods.
- Demonstrates competitive performance against state-of-the-art ES methods on various benchmarks.
Read more
$Ξ±$-PFN: Fast Entropy Search via In-Context Learning
Summary
The paper introduces $Ξ±$-PFN, a novel approach to enhance the efficiency of Entropy Search (ES) in Bayesian optimization (BO) by leveraging Prior-data Fitted Networks (PFNs). Traditional ES methods rely on complex Monte Carlo approximations for estimating information gain, which can be slow and error-prone. The authors propose a two-stage amortization strategy where the first PFN is trained to condition on optimal information, and the $Ξ±$-PFN is trained to predict expected information gain based on this. This allows for a rapid, single forward pass evaluation of acquisition functions, significantly speeding up the process. The empirical results demonstrate that $Ξ±$-PFN is competitive with existing state-of-the-art ES implementations, achieving over 50x speedup while maintaining optimization quality across synthetic and real-world benchmarks.
Methodology
The methodology involves training two PFNs: the first to condition on optimal information and the second, the $Ξ±$-PFN, to predict expected information gain. This approach replaces traditional Monte Carlo sampling with a learned approximation, allowing for efficient acquisition function evaluations in a single forward pass.
Results
The $Ξ±$-PFN was empirically validated against existing sampling-based approximations of PES, MES, and JES. The results showed that $Ξ±$-PFN not only maintained competitive optimization quality but also provided substantial reductions in computational time, achieving speedups greater than 50x across all tested scenarios.
Implications
The findings suggest that $Ξ±$-PFN can significantly enhance the efficiency of Bayesian optimization processes, making it more feasible for high-throughput optimization tasks. This could have broad applications in fields requiring optimization of costly black-box functions, such as hyperparameter tuning in machine learning.
Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach
Graph Learning
Theory
Time Series
- CascadeNet does not require a predefined diffusion model, reducing the risk of misspecification.
- The method employs a flexible estimator for the transition function, allowing for a wide range of applications.
- Neyman-orthogonal debiasing ensures unbiased estimates of the network Jacobian, facilitating formal statistical inference.
- CascadeNet outperforms existing methods in both simulated and real-world scenarios, particularly in recovering true network structures.
Read more
Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach
Summary
This paper addresses the challenge of recovering hidden influence networks from dynamic cascades, such as product adoption and disease spread, without relying on specific diffusion models. The proposed method, CascadeNet, utilizes a Jacobian-based machine learning framework that characterizes the influence structure through the Jacobian of the one-step transition function. By constructing a flexible estimator for the transition function and applying Neyman-orthogonal debiasing, CascadeNet achieves βn-consistency and asymptotic normality, enabling formal inference on network structures. The method is validated through simulations, demonstrating superior accuracy in network recovery across various data-generating processes, and in a real-world application to COVID-19 transmission in Spain, where it successfully recovers transmission networks that align closely with actual mobility patterns, outperforming baseline methods.
Methodology
CascadeNet constructs a flexible estimator for the one-step transition function and applies Neyman-orthogonal debiasing using the Riesz representer. This approach allows for the estimation of the Jacobian matrix, which encodes the influence network, without the need for specific parametric assumptions about the diffusion process.
Results
In simulations, CascadeNet achieved the highest accuracy in network recovery across nine common data-generating processes. In the empirical application to COVID-19 transmission in Spain, the recovered networks showed significant correlation with the true inter-province mobility network, while baseline methods failed to demonstrate such alignment.
Implications
The findings suggest that CascadeNet can be a powerful tool for decision-makers in various fields, including marketing, public health, and finance, by providing accurate insights into the underlying influence networks that drive dynamic cascades. This can enhance the effectiveness of interventions and strategies based on network structures.
QDSP: An Interpretable Structured Learning Framework for Predicting Death or Cerebral Palsy in Very Low Birth Weight Infants
Interpretability
- QDSP integrates QSS and DSP for robust and interpretable predictions in clinical settings.
- The framework achieved high accuracy and AUC in predicting outcomes for VLBWI.
- QDSP outperformed traditional machine learning and deep learning methods in the evaluation.
- The model identified clinically relevant predictors, enhancing interpretability.
Read more
QDSP: An Interpretable Structured Learning Framework for Predicting Death or Cerebral Palsy in Very Low Birth Weight Infants
Summary
The paper presents QDSP, an interpretable structured learning framework designed to predict mortality and cerebral palsy in very low birth weight infants (VLBWI). The framework addresses the challenges of high-dimensional and data-limited clinical settings by integrating two key components: Quota-guided Sub-space Sampling (QSS) and Differentiable-decision-guided Structure Perception (DSP). QSS focuses on constructing stability-aware and low-redundancy feature subspaces through bootstrap-based feature consistency estimation, while DSP utilizes differentiable soft oblique decision structures to model complex clinical interactions while maintaining interpretability. The framework was evaluated on a cohort of 51 VLBWI infants and validated against three public medical datasets. QDSP achieved an accuracy of 0.9200 and an AUC of 0.9714, outperforming several machine learning and deep learning baselines. Additionally, the framework demonstrated robust performance across varying sample sizes and clinical distributions, with SHAP-based analyses identifying clinically relevant predictors consistent with existing neonatal pathophysiological knowledge. The results indicate that QDSP is a promising tool for discharge-time risk stratification in VLBWI, potentially aiding in individualized clinical decision-making in neonatal intensive care.
Methodology
The QDSP framework combines two modules: Quota-guided Sub-space Sampling (QSS) for creating stable and low-redundancy feature subspaces, and Differentiable-decision-guided Structure Perception (DSP) for modeling nonlinear interactions with traceable decision paths. The framework was evaluated using a real-world cohort of VLBWI and validated on public datasets, employing metrics such as accuracy and AUC for performance assessment.
Results
QDSP achieved an accuracy of 0.9200 and an AUC of 0.9714 on the primary VLBWI cohort, outperforming models like XGBoost, TabNet, and TabPFN. The framework maintained competitive performance across external datasets, demonstrating robustness under varying conditions. SHAP analyses revealed significant predictors such as cystic periventricular leukomalacia and birth weight.
Implications
The QDSP framework offers a robust and interpretable approach for risk stratification in VLBWI, potentially improving clinical decision-making and outcomes in neonatal intensive care. Its ability to identify clinically relevant predictors may enhance personalized care strategies for vulnerable infants.
Covariance Shrinkage via Stochastic Interpolation
Theory
Optimization
- Covariance shrinkage is reformulated as empirical risk minimization over a stochastic interpolant.
- Three mechanisms for risk reduction are identified: scheduling, coupling, and early stopping.
- The method allows for non-linear flow maps that escape the limitations of classical shrinkage methods.
- The neural estimator outperforms traditional shrinkage methods in terms of out-of-sample performance.
Read more
Covariance Shrinkage via Stochastic Interpolation
Summary
This paper presents a novel approach to covariance shrinkage in high-dimensional settings by framing it as empirical risk minimization over a parametric stochastic interpolant between a source and a target distribution. The authors identify three mechanisms for reducing statistical risk: (1) Scheduling, which determines the class of admissible covariances; (2) Flow maps and couplings, where specific coupling structures can lower empirical risk and enable eigenvector regularization; and (3) Early stopping, which allows for a bias-variance trade-off through approximation of the true interpolant distribution. The proposed neural estimator of the interpolant is validated through synthetic experiments and applied to real neuroimaging data, demonstrating its regularization capabilities compared to traditional methods.
Methodology
The authors recast covariance shrinkage as empirical risk minimization using a parametric stochastic interpolant. They explore different interpolation types, coupling structures, and training strategies, including early stopping, to optimize the covariance estimation process. A neural network is employed to estimate the interpolant, allowing for flexibility in the covariance structure.
Results
The proposed estimator was validated on synthetic Gaussian benchmarks and real neuroimaging data, showing improved performance over traditional methods like Ledoit-Wolf and Wasserstein-2 shrinkage in terms of out-of-sample negative log-likelihood. The results indicate that the new approach effectively reduces statistical risk and enhances covariance estimation.
Implications
This work has significant implications for fields requiring accurate covariance estimation, such as finance, neuroimaging, and machine learning. The ability to reduce statistical risk through innovative methods can lead to better model performance and insights in high-dimensional data analysis.
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Large Language Models
Optimization
Theory
- ResearchClawBench provides a structured evaluation framework for autonomous scientific research across diverse domains.
- The benchmark includes 40 tasks based on real scientific papers, enabling a realistic assessment of AI capabilities.
- Current autonomous research agents and LLMs show limited effectiveness in achieving target-paper-level re-discovery.
- Expert-curated rubrics allow for detailed evaluation of scientific outputs, addressing the complexity of scientific research.
Read more
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
Summary
The paper introduces ResearchClawBench (RCBench), a comprehensive benchmark designed to evaluate the end-to-end autonomous scientific research capabilities of AI systems. It encompasses 40 tasks across 10 scientific domains, including Chemistry, Physics, and Neuroscience, each grounded in real published research. The benchmark aims to fill the gap in existing evaluations by providing a structured way to assess AI's ability to conduct research autonomously, starting from raw data to producing complete outputs. The evaluation framework includes expert-curated multimodal rubrics that break down expected outputs into weighted criteria, allowing for a nuanced assessment of AI performance. The authors evaluated seven autonomous research agents and seventeen native large language models (LLMs) using this benchmark. Results indicate that current systems, including the top-performing agent Claude Code and LLM Claude-Opus-4.7, fall short of achieving reliable target-paper-level re-discovery, with average scores of 21.5 and 20.7, respectively. The study highlights significant challenges in experimental protocol adherence and evidence alignment, underscoring the need for further advancements in autonomous scientific research capabilities.
Methodology
The authors developed ResearchClawBench by selecting real published papers and converting them into executable tasks. They created expert-curated rubrics to evaluate outputs against hidden target papers, allowing for a comprehensive assessment of AI performance. The evaluation involved systematic testing of seven autonomous research agents and seventeen LLMs under a unified protocol.
Results
The strongest autonomous agent, Claude Code, achieved an average score of 21.5, while the best LLM, Claude-Opus-4.7, averaged 20.7. The overall LLM frontier mean was only 26.5, indicating that both autonomous agents and LLMs struggle with reliable re-discovery of scientific knowledge.
Implications
ResearchClawBench serves as a foundational tool for measuring progress in autonomous scientific research, guiding future developments in AI systems designed for scientific inquiry. It highlights the need for improved methodologies in AI to enhance their effectiveness in real-world research scenarios.
Theoretical Foundations of Continual Learning via Drift-Plus-Penalty
Theory
Optimization
- Introduces a control-theoretic approach to continual learning, framing it as a dynamic process.
- Proposes the COLD framework, which regulates forgetting through a virtual queue and the DPP principle.
- Establishes stability and convergence guarantees for the proposed methods.
- Demonstrates superior performance compared to existing CL methods on benchmark datasets.
Read more
Theoretical Foundations of Continual Learning via Drift-Plus-Penalty
Summary
This paper addresses the challenges of continual learning (CL) in nonstationary data streams, where learning systems must adapt continuously without catastrophic forgetting. The authors introduce a control-theoretic perspective on CL, framing the adaptation to new tasks as a controlled process that maintains long-term stability. They propose a novel framework called COntinual Learning with Drift-Plus-Penalty (COLD), which utilizes the Drift-Plus-Penalty (DPP) principle from stochastic optimization. COLD employs a finite memory buffer to preserve representative samples from prior tasks, allowing for explicit regulation of forgetting. The framework minimizes the current task loss while tracking deviations from stability, thus capturing the stability-plasticity trade-off as a regulated dynamical process. The authors establish stability and convergence guarantees for the proposed methods and demonstrate through empirical results that COLD consistently outperforms various state-of-the-art CL baselines, achieving superior accuracy and tunable forgetting behavior.
Methodology
The authors utilize tools from stochastic control, particularly Lyapunov analysis, to model and regulate the dynamics of forgetting in continual learning. They introduce the COLD framework, which minimizes a weighted combination of immediate learning objectives and the Lyapunov drift of a virtual queue that tracks cumulative constraint violations over time.
Results
Empirical evaluations on standard benchmark datasets show that COLD consistently achieves higher accuracy than a wide range of state-of-the-art continual learning methods, while also exhibiting competitive forgetting behavior that reflects the explicit regulation of the stability-plasticity trade-off.
Implications
The proposed framework has significant implications for developing robust continual learning systems that can effectively adapt to new tasks while preserving previously learned knowledge, making it applicable in various real-world scenarios such as robotics, autonomous systems, and adaptive AI.
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
Generative Models
Optimization
Efficient ML
- Introduces a continual meta-learning framework for personalized cardiac simulations.
- Addresses the challenge of catastrophic forgetting in traditional meta-learning models.
- Utilizes a Bayesian Gaussian Mixture Model for effective data integration and identification.
- Demonstrates superior performance in simulation accuracy and computational scalability.
Read more
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
Summary
The paper introduces CoMetaPNS, a novel framework for creating personalized neural surrogates for cardiac electrophysiology simulations that addresses the challenges of model personalization and computational cost. Traditional methods require extensive retraining when new data is introduced, leading to issues like catastrophic forgetting. CoMetaPNS employs a continual meta-learning approach that allows the model to adapt to new subjects without the need for retraining on all previous data. By utilizing a continual Bayesian Gaussian Mixture Model over a memory buffer, the framework can identify and integrate new data while maintaining knowledge of past data. This method enhances the efficiency of personalized heart models, making them more scalable for clinical applications. Empirical evaluations on synthetic cardiac data demonstrate that CoMetaPNS outperforms existing baselines in simulation forecasting accuracy, computational efficiency, and resilience to forgetting.
Methodology
The methodology involves a continual meta-learning framework that integrates a Bayesian Gaussian Mixture Model with a memory buffer to manage incoming data. This allows the model to adapt to new subjects and identify data sources while avoiding the need for full retraining. The approach leverages few-shot generative modeling and amortized inference to personalize neural surrogates based on limited subject-specific data.
Results
The empirical results indicate that CoMetaPNS significantly improves simulation forecasting accuracy and computational efficiency compared to existing methods. The framework effectively integrates new data while preserving knowledge from previous data, demonstrating resilience to catastrophic forgetting.
Implications
The findings suggest that CoMetaPNS can enhance the scalability and applicability of personalized cardiac simulations in clinical environments, potentially improving treatment planning and risk stratification for patients with cardiac conditions.
When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery
Theory
Optimization
Interpretability
- CARTOGRAPH introduces a verification layer for AI scientists that integrates experiment selection, ambiguity resolution, and model refusal.
- The framework outperforms traditional methods in various experimental settings, demonstrating significant advantages in decision-making.
- CARTOGRAPH can retract incorrect model identifications based on new evidence, enhancing the reliability of autonomous discovery.
- The study emphasizes the need for AI systems to have mechanisms for stopping claims when the underlying model library is inadequate.
Read more
When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery
Summary
This paper introduces CARTOGRAPH, a novel verification layer designed for AI scientists engaged in autonomous scientific discovery. The framework addresses three critical decisions: selecting the most informative experiments, resolving scientific ambiguities, and refusing to identify models when the underlying library is inadequate. The authors highlight the limitations of existing Bayesian experimental design methods, which do not adequately handle the refusal aspect. CARTOGRAPH integrates unresolved-subspace steering for selection and resolution with a residual-based refusal mechanism. Empirical evaluations across various testbeds demonstrate that CARTOGRAPH-A significantly outperforms traditional projection methods, achieving a record of 129 wins against 15 losses in a structured cascade experiment. Notably, the framework successfully identifies and later retracts three out-of-library pharmacokinetic mechanisms as evidence reveals structural misfit, while maintaining identification of an in-library control. The findings underscore the importance of having a verifiable 'stop and escalate' signal in AI-driven scientific discovery processes.
Methodology
The authors developed CARTOGRAPH, which combines unresolved-subspace steering for selecting and resolving scientific ambiguities with a residual-based mechanism for refusing inadequate models. The framework was evaluated across multiple testbeds, including pharmacokinetic models and real-world data, using statistical comparisons to assess performance.
Results
CARTOGRAPH-A demonstrated superior performance in a structured cascade experiment, achieving 129 wins, 0 ties, and 15 losses (p < 10^-21). The framework effectively identified three out-of-library pharmacokinetic mechanisms but later retracted these identifications as evidence indicated structural misfit. In a retrospective audit of claims from an autonomous materials system, CARTOGRAPH flagged all inconclusive claims while confirming the majority of valid ones.
Implications
The findings suggest that integrating a verification layer like CARTOGRAPH can enhance the reliability of AI systems in scientific discovery, particularly in high-stakes fields such as pharmacokinetics and materials synthesis. This framework could lead to more robust decision-making processes in autonomous research environments.
RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking
Generative Models
Graph Learning
Optimization
- Introduction of RETROSPECT, a modular framework for retrosynthesis that separates proposal generation from candidate selection.
- Development of the ChemAlign Transformer, which employs advanced training techniques for improved prediction accuracy.
- Implementation of a LambdaMART reranker that enhances candidate selection based on various chemical descriptors.
- Demonstration of high accuracy rates on the USPTO-50K dataset, supporting the effectiveness of the proposed methods.
Read more
RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking
Summary
The paper presents RETROSPECT, a novel framework for single-step retrosynthesis that separates the candidate generation and selection processes. The authors introduce the ChemAlign Transformer, a proposal model that utilizes hybrid SMILES augmentation and various advanced training techniques to enhance its predictive capabilities. Following the generation of precursor candidates, a LambdaMART reranker is employed to refine the candidate list based on structural and reaction-template features, among others. The study demonstrates that the proposal model achieves a top-1 exact-match accuracy of 55.00% and a top-10 accuracy of 86.18% on the USPTO-50K test set, with a high validity rate. The reranking process further improves the top-1 accuracy to 59.4% on a merged candidate pool benchmark. The findings indicate that the proposal model can function effectively as a standalone system while also serving as a valuable component in ensemble approaches, highlighting the importance of separating proposal generation from candidate ranking in retrosynthesis tasks.
Methodology
The methodology involves a two-stage process: first, generating precursor candidates using the ChemAlign Transformer, which incorporates hybrid SMILES augmentation and advanced training techniques; second, reranking the generated candidates with a LambdaMART model based on structural and reaction-template features, as well as optional DFT-derived descriptors.
Results
The ChemAlign Transformer achieved a top-1 exact-match accuracy of 55.00% and a top-10 accuracy of 86.18% on the USPTO-50K test set, with a validity rate of 99.86%. The LambdaMART reranker improved the top-1 accuracy to 59.4% on a merged candidate pool benchmark, indicating the effectiveness of the reranking process.
Implications
The findings suggest that RETROSPECT can enhance the efficiency and accuracy of retrosynthesis planning in computational chemistry, providing a robust framework for future research and applications in chemical synthesis and drug discovery.
From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning
Reinforcement Learning
NLP
Large Language Models
- Identification of pervasive shortcut issues in popular ToM datasets that mislead model evaluations.
- Development of a framework for auditing ToM datasets to quantify shortcut prevalence.
- Introduction of Thinking-RFT, which significantly improves ToM reasoning capabilities over traditional methods.
- Demonstration of robust generalization and performance gains in complex reasoning tasks.
Read more
From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning
Summary
This paper addresses the critical need for Theory of Mind (ToM) capabilities in AI systems, particularly in the context of post-training methods. The authors identify a significant issue with existing ToM datasets, where models can achieve high accuracy by exploiting spurious correlations, leading to misleading assessments of their reasoning abilities. To tackle this, they develop a framework to audit ToM datasets for such shortcuts and propose a novel post-training approach called Thinking-RFT (Reinforcement Fine-Tuning with verifiable rewards and explicit reasoning chains). The study employs four shortcut-free datasets across three ToM contexts to evaluate the effectiveness of Thinking-RFT compared to traditional Supervised Fine-Tuning (SFT). The findings reveal that Thinking-RFT consistently enhances ToM performance, particularly in complex reasoning scenarios, and demonstrates improved generalization to unseen domains. The authors conclude that integrating reasoning with reinforcement learning is crucial for advancing ToM capabilities in AI, providing valuable insights for future dataset development and model training strategies.
Methodology
The authors conducted a systematic examination of ToM datasets to identify shortcuts, followed by an empirical study comparing three post-training strategies: Supervised Fine-Tuning (SFT), Thinking-RFT, and No-Thinking-RFT. They utilized four shortcut-free datasets across various ToM contexts to evaluate the effectiveness of these methods, focusing on reasoning and generalization capabilities.
Results
Thinking-RFT outperformed SFT by an average of 6%, with notable improvements of 10% in complex reasoning tasks and 7% in multimodal scenarios. The model demonstrated better generalization to unseen domains and higher-order queries, and the combination of reasoning and reinforcement learning was found to be particularly beneficial, with Thinking-RFT outperforming No-Thinking-RFT by 7% on average.
Implications
The findings suggest that addressing shortcut issues in ToM datasets is essential for developing robust AI systems capable of genuine reasoning. The proposed Thinking-RFT method could serve as a foundation for future advancements in ToM capabilities, enhancing AI's ability to interact naturally and safely in real-world scenarios.
Heterogeneous Effects of Green Finance on Urban Decarbonization: Evidence from 285 Cities in China
Theory
Interpretability
- Green finance significantly lowers carbon intensity in urban areas.
- The effects of green finance are most pronounced in less developed cities.
- Energy structure optimization is the primary mechanism through which green finance operates.
- Different financial instruments have varying impacts on decarbonization.
Read more
Heterogeneous Effects of Green Finance on Urban Decarbonization: Evidence from 285 Cities in China
Summary
This study investigates the impact of green finance on urban decarbonization across 285 cities in China, focusing on the mechanisms through which it affects carbon intensity. Utilizing econometric models and machine learning techniques, the research finds that green finance significantly reduces carbon intensity, with green bonds and green investments showing the strongest effects. The study highlights that the impact of green finance varies by city development level, being most effective in Fourth- and Fifth-tier cities. Mediation analysis indicates that green finance primarily influences carbon reduction through energy structure optimization, industrial upgrading, foreign direct investment, and technological innovation. SHAP analysis reveals that different financial instruments contribute differently to decarbonization, with green bonds, funds, and credit being the most effective. The findings suggest that cities with lower technological capacity, higher industrial dependency, and coal-based energy mixes experience stronger marginal impacts from green finance. This research provides valuable insights for developing a differentiated green finance system to support inclusive low-carbon transitions.
Methodology
The study employs a combination of fixed effects models, spatial econometric approaches, and causal forest algorithms to analyze panel data from 285 prefecture-level cities in China from 2000 to 2022.
Results
The analysis reveals that green finance has a significant decarbonization effect, particularly through green bonds and investments. The impact varies by city development level, with stronger effects observed in Fourth- and Fifth-tier cities. Mediation analysis identifies energy structure optimization as the main pathway for carbon reduction, followed by industrial upgrading and technological innovation.
Implications
The findings suggest that policymakers should consider the heterogeneous effects of green finance when designing strategies for urban decarbonization. A multi-level, regionally differentiated approach to green finance could enhance the effectiveness of low-carbon transitions across diverse urban contexts.
Product units in gated recurrent units improve nuclear-mass prediction
Time Series
- Introduction of MI-PU-GRU and AM-PU-GRU architectures for nuclear mass prediction.
- Utilization of complex-valued computations to capture amplitude and phase dynamics.
- Significant reduction in prediction errors compared to traditional GRU and other models.
- Establishment of a new benchmark for sequence-based nuclear mass prediction.
Read more
Product units in gated recurrent units improve nuclear-mass prediction
Summary
This paper presents a novel machine learning approach for predicting nuclear masses using gated recurrent units (GRUs) enhanced with product units. The authors introduce two new architectures: the multiplicative-interaction product-unit GRU (MI-PU-GRU) and the additive-multiplicative product-unit GRU (AM-PU-GRU). These architectures leverage multiplicative interactions and complex-valued computations to improve the model's ability to capture long-term dependencies and higher-order nonlinear relationships in nuclear mass data. The models are evaluated on interpolation and extrapolation tasks using the atomic mass evaluation datasets (AME2016 and AME2020). The AM-PU-GRU model achieves the lowest root mean square error (RMSE) for both interpolation (0.227 Β± 0.004 MeV) and extrapolation (0.179 Β± 0.015 MeV), outperforming traditional GRU baselines and other state-of-the-art machine learning models. The findings suggest that complex-valued product-unit recurrent networks set a new benchmark for sequence-based nuclear mass prediction, offering significant improvements over existing methods.
Methodology
The authors developed two novel GRU architectures that incorporate product units and multiplicative interactions to enhance the model's capacity for nonlinear relationships. The models were trained and evaluated on nuclear mass data, specifically focusing on interpolation and extrapolation tasks using AME2016 and AME2020 datasets.
Results
The AM-PU-GRU model achieved an interpolation RMSE of 0.227 Β± 0.004 MeV and an extrapolation RMSE of 0.179 Β± 0.015 MeV, outperforming both traditional GRU baselines and other state-of-the-art machine learning models.
Implications
The proposed models can significantly advance the field of nuclear physics by providing more accurate predictions of nuclear masses, which are crucial for understanding nuclear structure, nucleosynthesis pathways, and applications in nuclear energy and astrophysics.
SRT: Super-Resolution for Time Series via Disentangled Rectified Flow
Time Series
Generative Models
- Introduces SRT, a framework for time series super-resolution that reconstructs high-resolution data from low-resolution inputs.
- Utilizes a disentangled rectified flow approach to decompose time series into trend and seasonal components.
- Implements a novel cross-resolution attention mechanism to enhance detail generation.
- SRT-large variant shows strong zero-shot super-resolution capabilities through extensive pre-training.
Read more
SRT: Super-Resolution for Time Series via Disentangled Rectified Flow
Summary
The paper presents a novel framework called Super-Resolution for Time series (SRT) aimed at reconstructing high-resolution time series data from low-resolution inputs. The authors highlight the critical need for fine-grained time series data across various applications, such as healthcare and industrial IoT, where high-resolution signals are essential for accurate analytics. The SRT framework addresses the challenges of directly applying image super-resolution techniques to time series data by introducing a disentangled rectified flow approach. This involves decomposing the low-resolution input into trend and seasonal components, aligning them to the target resolution using an implicit neural representation, and employing a cross-resolution attention mechanism to generate high-resolution details. Additionally, the paper introduces SRT-large, a scaled-up version with extensive pre-training, which enhances zero-shot super-resolution capabilities. Experimental results on nine public datasets demonstrate that both SRT and SRT-large outperform existing methods across multiple scale factors, showcasing robust performance and the effectiveness of the proposed architecture components.
Methodology
The SRT framework decomposes low-resolution time series into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and generates high-resolution details through two separate rectified flow models. A cross-resolution attention mechanism is employed to effectively fuse information and guide the generation process.
Results
Extensive experiments on nine public datasets indicate that SRT and SRT-large consistently outperform existing super-resolution methods, achieving robust performance across various scale factors. The results validate the effectiveness of the disentangled rectified flow approach and the cross-resolution attention mechanism.
Implications
The proposed SRT framework has significant implications for applications requiring high-resolution time series data, such as healthcare diagnostics, industrial monitoring, and climate modeling. By enabling the reconstruction of high-resolution signals from low-resolution inputs, it can enhance the accuracy of analytics and decision-making in these domains.
Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense
Graph Learning
- Graph reconstruction attacks can expose sensitive information from GNNs, necessitating effective defense mechanisms.
- The study provides a systematic characterization of adjacency recoverability influenced by graph homophily and heterophily.
- The proposed MC-GRA (+) attack method enhances reconstruction fidelity over prior techniques.
- The MC-GPB (+) defense method successfully mitigates reconstruction success with only slight accuracy trade-offs.
Read more
Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense
Summary
This paper investigates the vulnerabilities of Graph Neural Networks (GNNs) to graph reconstruction attacks (GRA), which aim to recover sensitive training graph information from a trained model. The authors systematically characterize the conditions under which adjacency information can be reconstructed, highlighting the roles of graph homophily, heterophily, and model inductive bias. They propose a novel perspective on GNN inference through a Markov chain approximation, treating the forward computation as a chain of topology-dependent representations. To address the identified vulnerabilities, the authors develop two methods: MC-GRA (+) for attack, which optimizes a surrogate adjacency to align with the target model's representations, and MC-GPB (+) for defense, which suppresses adjacency-dependent information while maintaining classification accuracy. Experimental results demonstrate that the proposed attack method significantly improves reconstruction fidelity compared to existing approaches, while the defense method effectively reduces reconstruction success with minimal accuracy loss.
Methodology
The authors employ a Markov chain approximation framework to analyze GNN inference, treating the forward computation as a sequence of topology-dependent representations. They develop the MC-GRA (+) attack method to reconstruct the training adjacency by optimizing a surrogate adjacency, and the MC-GPB (+) defense method to suppress adjacency-dependent information while preserving classification accuracy.
Results
The experimental results indicate that the MC-GRA (+) attack method achieves higher reconstruction fidelity than previous methods across various benchmarks. The MC-GPB (+) defense method effectively reduces the success of reconstruction attacks while maintaining classification performance with only minor accuracy loss.
Implications
This research highlights the need for robust defenses in GNN applications, particularly in privacy-sensitive domains. The findings can inform the design of more secure GNN architectures and contribute to the development of privacy-preserving machine learning techniques.
A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking
Theory
- Introduces a held-out transition-pair falsifier to evaluate non-Abelian state tracking.
- Demonstrates that a projected recurrent state model can achieve perfect predictions over long horizons.
- Mechanism diagnostics reveal the relationship between projection temperature and model performance.
- Confirms that blocking local transition memorization pathways is crucial for accurate state tracking.
Read more
A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking
Summary
This paper addresses the limitations of sequence models in state tracking, particularly in scenarios where the relevant signal is an ordered latent state evolving through non-commutative transformations. The author introduces a held-out transition-pair falsifier designed for finite non-Abelian group tracking, which prevents certain ordered generator pairs from being included in training sequences while requiring them in evaluation sequences. This approach effectively blocks memorization of local transition patterns. The study employs a projected recurrent state model that maintains a non-commutative recurrent hidden state, demonstrating that it can achieve error-free predictions in a controlled S3 Γ S3 benchmark, even at evaluation horizons significantly longer than the training length. The paper also provides mechanism diagnostics that reveal how the model's performance correlates with various projection temperatures. The findings suggest that explicit projected non-commutative state composition serves as a beneficial inductive bias for long-horizon hidden-state tracking, highlighting the importance of distinguishing between genuine state composition and mere memorization of local patterns.
Methodology
The paper employs a held-out transition-pair falsifier protocol, which excludes specific ordered generator pairs from training sequences while requiring them in evaluation sequences. A projected recurrent state model is utilized to maintain a continuous-valued non-commutative hidden state, with symbolic outputs generated through temperature-controlled projections. Mechanism diagnostics assess model performance across various axes, including final-token accuracy and homomorphism error.
Results
The projected recurrent state model achieved perfect final-state predictions (250/250) across evaluation horizons up to 1,048,576 tokens in the S3 Γ S3 benchmark. Baseline models, including GRU and structured state-space models, performed poorly under the same conditions, indicating that they relied on memorized local transition patterns. Mechanism diagnostics showed that hard projection correlated with low homomorphism error and state-consistency drift, while softened projection led to decreased accuracy.
Implications
The findings suggest that the proposed falsifier and model can serve as a diagnostic tool for evaluating the capabilities of sequence models in state tracking tasks, particularly in scenarios involving non-commutative operations. This could have implications for developing more robust models in applications requiring long-context reasoning and accurate state management, such as robotics and complex workflow control.
scCBGM: Interpretable Single-Cell Counterfactual Editing
Generative Models
Interpretability
- Introduction of scCBGM for interpretable single-cell counterfactual editing.
- Architectural innovations enhance model performance without dimensional constraints.
- Development of a synthetic benchmark for rigorous evaluation of counterfactuals.
- Demonstrated superior performance in editing accuracy and generalization across datasets.
Read more
scCBGM: Interpretable Single-Cell Counterfactual Editing
Summary
The paper introduces the Single-Cell Concept Bottleneck Generative Models (scCBGM), a novel framework designed for interpretable and precise counterfactual editing of individual cells in the context of single-cell RNA sequencing (scRNA-seq). The authors address the challenge of understanding cellular responses to perturbations, which is crucial for disease biology and therapeutic design. Traditional experimental mapping of cellular responses is infeasible due to the vast combinatorial space of conditions. scCBGM adapts concept bottleneck architectures to single-cell data, incorporating architectural modifications such as decoder skip connections and a cross-covariance penalty to enhance disentanglement without dimensional constraints. The framework is versatile, extending to flow matching models for concept-guided editing in both encoding-decoding and generation regimes. To validate the model, the authors develop a synthetic benchmark with ground-truth counterfactuals and demonstrate that scCBGM outperforms existing methods in combinatorial generalization and counterfactual prediction across multiple real datasets. The results indicate strong cell-level editing accuracy and competitive performance on population-level metrics, showcasing the model's potential for elucidating mechanistic hypotheses in therapeutic contexts.
Methodology
The scCBGM framework employs concept bottleneck architectures tailored for single-cell RNA-seq data. It integrates decoder skip connections and a cross-covariance penalty to promote disentangled embeddings. The model is evaluated using a synthetic data generation process that separates noise from conditions, allowing for systematic testing of counterfactual editing capabilities.
Results
scCBGM exhibits strong cell-level editing accuracy on synthetic benchmarks with ground-truth counterfactuals and competitive performance on population-level metrics across three real-world datasets. It outperforms several state-of-the-art methods in both combinatorial and zero-shot generalization, demonstrating its effectiveness in predicting cellular responses to specified interventions.
Implications
The scCBGM framework has significant implications for advancing our understanding of cellular phenotypes and responses to treatments, facilitating causal discovery and therapeutic design in precision medicine. Its ability to provide interpretable control over biological concepts enhances its utility in research and clinical applications.
ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- ConSteer-RL integrates model-internal confidence signals into RLVR training.
- The framework employs a confidence-aware reward shaping mechanism to improve reasoning accuracy.
- Experimental results show significant performance improvements over existing GRPO methods.
- The approach does not require additional human annotations or complex verification systems.
Read more
ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning
Summary
The paper introduces ConSteer-RL, a novel framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) through a confidence-aware reinforcement learning approach. Traditional Reinforcement Learning from Verifiable Rewards (RLVR) methods face challenges due to sparse binary rewards and the neglect of model-internal uncertainty. ConSteer-RL addresses these limitations by integrating token-level confidence signals derived from model log-probabilities into the RLVR training process. The framework utilizes a confidence-aware reward mechanism that aggregates per-token probabilities into a scalar confidence score, which is then incorporated into a reward shaping strategy that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental evaluations demonstrate that ConSteer-RL consistently outperforms existing GRPO baselines, achieving average performance improvements of 2.3% to 4.0% across various model scales and mathematical reasoning tasks. This approach not only enhances the reasoning performance of LLMs but also allows for a more nuanced understanding of model confidence during the training process.
Methodology
The methodology involves extracting token-level confidence from log probabilities during the rollout generation of LLMs. This confidence is transformed into a composite reward that combines correctness with confidence-aware shaping. The resulting reward is then optimized using the Group Relative Policy Optimization (GRPO) framework, which enhances training stability and efficiency.
Results
ConSteer-RL achieved average performance improvements of 2.3% to 4.0% across different model scales on seven mathematical reasoning benchmarks, demonstrating its effectiveness in enhancing reasoning capabilities compared to strong GRPO baselines.
Implications
The findings suggest that incorporating confidence signals into reinforcement learning can lead to more reliable reasoning behaviors in LLMs, potentially improving their performance in complex reasoning tasks such as mathematical problem-solving, code generation, and question answering. This approach may also pave the way for more robust AI systems that can better understand and manage uncertainty in their predictions.
Trajectory Geometry of Transformer Representations Across Layers
NLP
Large Language Models
Interpretability
- Introduces a trajectory-geometric framework for transformer interpretability.
- Identifies significant trajectory convergence for semantically related prompts in deeper layers.
- Demonstrates that reasoning tasks exhibit higher trajectory curvature than lexical tasks.
- Shows measurable trajectory bifurcation for ambiguous tokens, indicating effective disambiguation.
Read more
Trajectory Geometry of Transformer Representations Across Layers
Summary
This paper addresses the mechanistic interpretability of transformer models by analyzing how their representations evolve across layers. The authors propose a framework that views the forward pass of transformers as a discrete trajectory through a high-dimensional representation manifold. They define five metrics to characterize this trajectory geometry: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. The study is conducted on three transformer architectures (GPT-2, TinyLlama, Qwen2.5) using five semantically controlled prompt families. The findings reveal significant convergence of semantically related prompts in middle-to-late layers, indicating attractor-like dynamics. Additionally, reasoning tasks show greater trajectory curvature compared to lexical tasks, suggesting that curvature reflects computational complexity. The paper also identifies trajectory bifurcation for ambiguous tokens, indicating a clear disambiguation signature, and establishes a universal three-phase computational structure across layers. The authors release an open-source pipeline for trajectory analysis, emphasizing that their approach offers a probe-free lens for understanding transformer dynamics.
Methodology
The authors recast the transformer forward pass as a discrete trajectory through a high-dimensional representation manifold. They compute five metrics (trajectory length, curvature, semantic convergence index, layerwise cosine similarity, representational stability) to analyze the geometry of these trajectories across three transformer architectures and various prompt families.
Results
The study finds that semantically related prompts converge significantly in the middle-to-late layers, with convergence indices ranging from 0.41 to 0.58. Reasoning tasks show greater curvature (0.71β0.83 rad) compared to lexical tasks (0.27β0.31 rad). Ambiguous tokens demonstrate trajectory bifurcation with up to 5.6Γ representational separation in the final layer. The analysis reveals a consistent three-phase computational structure across architectures.
Implications
The findings suggest that trajectory geometry can serve as a powerful tool for understanding the internal workings of transformer models, providing insights into their representational dynamics without the need for probing classifiers. This approach could enhance the interpretability of large language models and inform future research in mechanistic interpretability.
Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models
NLP
Large Language Models
Theory
- Introduces finite semantic certificates for verifying language model behavior.
- Establishes an exact row-space criterion for finite determinacy in context-conditioned queries.
- Proves an anti-mirage theorem to differentiate between genuine semantic transitions and scoring discontinuities.
- Demonstrates NP-completeness in extracting the smallest forcing subcontext.
Read more
Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models
Summary
This paper investigates the behavior of context-conditioned language models through two verification problems: finite determinacy and threshold emergence. The authors introduce finite semantic certificates to replace benchmark labels, focusing on when examples in a context can force a specific answer to a query without altering model parameters. They establish an exact row-space criterion for finite-field linear task families, compute the residual hypothesis count, and derive a determination curve. The problem of extracting the smallest forcing subcontext is shown to be NP-complete. The second problem, threshold emergence, examines when a benchmark score jump indicates a genuine semantic transition rather than a mere discontinuity in the scoring map. The authors prove an anti-mirage theorem that distinguishes thresholded metrics from semantic confidence and provide a rate-sensitive crossing bound for latent commitments. The paper also presents a calculus that yields reusable objects such as finite context certificates and query teaching dimensions. The authors conduct experiments with contemporary language models to assess the reach of the proposed certificates, demonstrating that the predicted threshold jump aligns with observed model behavior. The findings suggest a framework for verifying language model behavior that separates semantic transitions from metric artifacts.
Methodology
The authors utilize mathematical proofs to establish criteria for finite determinacy and threshold emergence, employing exact row-space analysis and deriving determination curves. They also run empirical benchmarks on contemporary language models to validate their theoretical findings.
Results
The paper presents a sound oracle-free checker for language model behavior, demonstrating that the predicted threshold jump occurs in model evaluations. The results indicate that the proposed finite context certificates effectively isolate semantic transitions and provide a framework for further verification.
Implications
The findings have significant implications for the evaluation and understanding of language models, offering a formalized approach to verify model behavior and assess the impact of context on output determinacy. This could enhance the reliability of language models in practical applications.
Fourier fractal dimension to predict the generalization of deep neural networks
Theory
Optimization
- Introduces a novel generalization measure based on Fourier fractal dimension.
- Demonstrates strong correlation between the proposed measure and actual generalization gap.
- Outperforms existing methods in predicting generalization without validation data.
- Presents a customized Fourier-based optimizer to regularize fractal dimension during training.
Read more
Fourier fractal dimension to predict the generalization of deep neural networks
Summary
This paper addresses the challenge of predicting the generalization performance of deep neural networks without relying on hold-out validation data. The authors propose a novel measure based on the Fourier fractal dimension of the network's weight variations, which captures the geometric complexity of the learning process. By analyzing the characteristic function of LΓ©vy-driven stochastic differential equations in the frequency domain, they derive a metric that correlates strongly with the generalization gap. Additionally, the authors introduce a customized Fourier-based optimizer that regularizes this fractal dimension during training. Empirical evaluations on datasets such as CIFAR-10, SVHN, and MNIST demonstrate that their Fourier generalization measure outperforms existing norm-based, margin-based, and PAC-Bayesian measures, achieving state-of-the-art Kendall rank correlation coefficients. This work emphasizes the potential of frequency-domain fractal analysis as a robust predictor for model generalizability and a foundation for developing more stable optimization algorithms.
Methodology
The authors analyze the weight variations of deep neural networks using the Fourier transform to derive a fractal dimension measure. They also develop a Fourier-based optimizer that actively regulates this fractal dimension during the training process, allowing for improved generalization performance.
Results
The proposed Fourier generalization measure showed a strong correlation with the actual generalization gap across multiple datasets, achieving state-of-the-art Kendall rank correlation coefficients. The method outperformed various existing measures, indicating its effectiveness in predicting model generalizability.
Implications
The findings suggest that frequency-domain fractal analysis can serve as a powerful tool for predicting the generalization capabilities of deep neural networks. This could lead to advancements in automated machine learning (autoML) and the development of more effective training algorithms that prioritize generalization.
GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting
Time Series
- Introduction of GlucoFM-Bench, the first benchmark for evaluating TSFMs in blood glucose forecasting.
- Assessment of eight state-of-the-art models across 15 diabetes-relevant datasets.
- Demonstration of strong transferability of pre-trained TSFMs, particularly in zero-shot and few-shot scenarios.
- Highlighting the superior performance of lightweight LSTM models when abundant task-specific data is available.
Read more
GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting
Summary
GlucoFM-Bench presents a comprehensive benchmark for evaluating time-series foundation models (TSFMs) in the context of blood glucose forecasting, which is crucial for diabetes management. The study addresses the challenges of glucose prediction due to heterogeneous physiological dynamics across different diabetes populations. It evaluates eight representative architectures, including pre-trained TSFMs and task-specific deep learning models, across 15 publicly available diabetes-relevant datasets involving 1,117 individuals. The models are assessed under various protocols: zero-shot, few-shot, and full-shot, with systematic variations in context length and prediction horizon. The findings reveal that pre-trained TSFMs, particularly Chronos-2 and TimesFM, exhibit strong zero-shot and few-shot transfer capabilities, with the best zero-shot model performing within 5% of the best full-shot supervised model. However, when ample task-specific data is available, a lightweight LSTM outperforms TSFMs by 4-21% under full-shot training. The study also highlights persistent challenges in type 1 diabetes cohorts and hypo-/hyperglycemic ranges, emphasizing the need for evaluations beyond aggregate error metrics. GlucoFM-Bench establishes a standardized and reproducible foundation for future research in blood glucose forecasting, with publicly available datasets and code.
Methodology
The study employs a benchmarking approach that includes zero-shot, few-shot, and full-shot evaluations of various models. It systematically varies context length and prediction horizon while analyzing performance across different diabetes types and glycemic conditions. The evaluation metrics include standard accuracy measures (RMSE and MAE) and clinical risk metrics based on Clarke and Surveillance Error Grids.
Results
Pre-trained TSFMs, especially Chronos-2 and TimesFM, showed strong performance in zero-shot and few-shot settings, with the best zero-shot model performing within 5% of the best full-shot supervised model. However, in full-shot training scenarios, a lightweight LSTM outperformed TSFMs by 4-21%. The study also identified persistent challenges in predicting for type 1 diabetes cohorts and in hypo-/hyperglycemic ranges.
Implications
GlucoFM-Bench provides a standardized framework for evaluating and improving blood glucose forecasting models, which can enhance diabetes management systems and support automated insulin delivery. The findings can inform future research directions and model development in healthcare applications.
Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters
Theory
Optimization
- Deterministic rounding degrades GD's generalization error from O(T/n) to O(T/βn).
- Uniform stability of GD becomes β¦(T), leading to vacuous generalization bounds.
- SGD with deterministic rounding achieves tighter uniform stability bounds depending on dimensionality.
- Stochastic rounding can increase generalization error with higher dimensions.
Read more
Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters
Summary
This paper investigates the effects of discrete parameter spaces on the generalization error and stability of gradient descent (GD) and stochastic gradient descent (SGD) algorithms, particularly when updates involve deterministic or stochastic rounding. The authors demonstrate that deterministic rounding significantly worsens the generalization error of GD, increasing it from O(T/n) to O(T/βn) for convex, Lipschitz, and smooth loss functions. They establish that the uniform stability of GD in this context is lower bounded by β¦(T), rendering stability-based generalization bounds ineffective. Conversely, for SGD with deterministic rounding, the authors provide nontrivial uniform stability guarantees, showing bounds of O(T/n) in one dimension and O(TΒ²/n) in higher dimensions. The paper also highlights that stochastic rounding can lead to an increase in generalization error with dimensionality, a phenomenon not observed in traditional real-valued optimization. Additionally, the authors present upper bounds on uniform argument stability for stochastic rounding schemes, which are tight when the loss function can be expressed as a sum of coordinate-wise functions. Overall, the findings illustrate how finite-precision effects alter the behavior of optimization algorithms, necessitating a reevaluation of generalization guarantees in discrete settings.
Methodology
The authors analyze the generalization error and stability of GD and SGD in discrete parameter spaces using theoretical proofs. They derive bounds for generalization error and uniform stability under deterministic and stochastic rounding conditions, employing techniques from statistical learning theory and algorithmic stability frameworks.
Results
The paper presents several key results: 1) The generalization error for T-step DR-GD is O(Ξ·T/βn), indicating a degradation in sample complexity due to deterministic rounding. 2) The uniform stability of DR-GD is lower bounded by β¦(T), making stability-based bounds ineffective. 3) For DR-SGD, uniform stability is bounded by O(Ξ·T/n) in one dimension and O(Ξ·TΒ²/n) in higher dimensions. 4) Stochastic rounding introduces a dimension-dependent increase in generalization error for both SR-GD and SR-SGD, which is not present in the other cases.
Implications
The findings suggest that traditional analyses of optimization algorithms may not hold in discrete settings, particularly in applications involving fixed-point arithmetic. This has implications for the design and understanding of machine learning algorithms deployed on digital hardware, where rounding errors are prevalent.
Closed-Form Spectral Regularization for Multi-Task Model Merging
Efficient ML
Multimodal
Optimization
- Introduces a closed-form spectral regularization approach for multi-task model merging.
- Demonstrates that iterative optimization acts as an implicit spectral regularizer.
- Proposes SWUDI and SWUDI-A, which significantly reduce computational costs while maintaining or improving accuracy.
- Achieves substantial reductions in wall-clock time and GPU memory usage compared to state-of-the-art methods.
Read more
Closed-Form Spectral Regularization for Multi-Task Model Merging
Summary
This paper addresses the challenge of model merging, which combines multiple independently fine-tuned models into a single multi-task model without requiring additional training data. Traditional methods for model merging, such as WUDI and OptMerge, rely on iterative optimization techniques that minimize layer-wise quadratic interference. However, these methods are computationally expensive and often yield suboptimal results compared to their closed-form solutions. The authors investigate the reasons behind the superior performance of iterative methods and discover that they function as implicit spectral regularizers, effectively filtering out noise in small-eigenvalue directions of the interference operator. Building on this insight, the authors propose a new approach called SWUDI, which employs a closed-form spectral filtering estimator that combines a soft exponential filter with a hard top-K truncation to suppress noise. An adaptive variant, SWUDI-A, is also introduced, which adjusts the rank parameter per layer to enhance robustness. Both methods significantly outperform existing merging techniques in terms of accuracy and efficiency, achieving reductions in wall-clock time by 28β72 times and peak GPU memory usage by up to 50%. The proposed methods are validated across various benchmarks, demonstrating their effectiveness in merging multi-task models.
Methodology
The authors formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by per-direction filters. They introduce SWUDI, which combines a soft exponential filter with a hard top-K truncation, and SWUDI-A, an adaptive version that adjusts rank parameters based on the eigenspectrum of the interference operator. Both methods require only a single symmetric eigendecomposition per layer and do not need any training data.
Results
The proposed methods, SWUDI and SWUDI-A, match or exceed the performance of existing state-of-the-art merging techniques across four general benchmarks and a multimodal merging benchmark. They achieve reductions in wall-clock time by 28β72 times and peak GPU memory usage by up to 50%, demonstrating significant efficiency improvements.
Implications
The findings suggest that closed-form solutions can be effectively utilized in model merging, potentially leading to more efficient deployment of large foundation models in various applications. The methods can be applied in scenarios where computational resources are limited, making them suitable for real-time applications and edge computing.
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Reinforcement Learning
Theory
Optimization
- Introduces a framework for two-sided matching with temporally extended feedback.
- Models matching as a partially observable Markov game with evolving agent profiles.
- LEARN2MATCH benchmark supports decentralized decision-making in dynamic matching markets.
- Independent PPO outperforms CA-ETC in social welfare and regret but has higher information-friction loss.
Read more
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Summary
This paper introduces a novel framework for two-sided matching markets that incorporates temporally extended feedback, addressing the limitations of traditional matching models that rely on immediate feedback. The authors propose a partially observable Markov game model that captures the dynamics of matching decisions over time, including costly pre-match screening, noisy post-match observations, and evolving latent profiles. They instantiate this framework in a benchmark called LEARN2MATCH, which facilitates decentralized decision-making regarding interviews, matches, and relationship continuations or dissolutions. The evaluation of multi-agent reinforcement learning (MARL) methods, particularly independent PPO, against a bandit-style baseline (CA-ETC) demonstrates that MARL can achieve higher cumulative social welfare and lower cumulative regret under temporally extended feedback. However, the MARL approach incurs a higher information-friction loss, indicating that it lacks the coordinated exploration structure present in bandit methods. The findings position LEARN2MATCH as a benchmark for future research in matching algorithms that combine adaptive reinforcement learning with the statistical rigor of bandit algorithms and the structural awareness of stable-matching mechanisms.
Methodology
The authors developed a framework for two-sided matching modeled as a partially observable Markov game, incorporating elements such as evolving latent profiles, costly pre-match and post-match decisions, and noisy observations. They created the LEARN2MATCH benchmark to evaluate multi-agent reinforcement learning methods against bandit-style algorithms.
Results
The independent PPO algorithm demonstrated superior performance in terms of cumulative social welfare and lower cumulative regret compared to the CA-ETC baseline under temporally extended feedback. However, PPO exhibited a higher information-friction loss, highlighting the need for improved exploration strategies in MARL.
Implications
The findings suggest that integrating reinforcement learning with matching theory can enhance decision-making in dynamic markets. LEARN2MATCH serves as a platform for future research, encouraging the development of algorithms that effectively manage the complexities of temporally extended feedback in matching scenarios.
RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning
NLP
Large Language Models
Reinforcement Learning
- RASFT addresses the limitations of traditional SFT by introducing a policy-aware framework for reasoning tasks.
- The framework dynamically adjusts expert supervision based on the model's problem-solving ability, enhancing adaptability.
- Empirical results show significant performance improvements over standard SFT and other reinforcement learning methods.
- RASFT preserves the model's inherent reasoning capabilities while effectively integrating expert guidance.
Read more
RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning
Summary
The paper introduces Rollout-Adaptive Supervised Fine-Tuning (RASFT), a novel framework designed to enhance the adaptation of large language models (LLMs) for reasoning tasks. Traditional supervised fine-tuning (SFT) relies heavily on imitating expert trajectories, which can lead to overfitting and suppress the model's inherent reasoning capabilities. RASFT addresses this limitation by employing a policy-aware approach that calibrates expert supervision based on the model's current problem-solving ability, as assessed through verified on-policy rollouts. The framework dynamically adjusts the influence of expert guidance depending on the model's performance on specific problems, strengthening expert supervision when the model struggles and allowing for self-generated trajectories when the model demonstrates reliable reasoning. Additionally, RASFT incorporates a clipped inverse ratio to prevent excessive policy drift. Experimental results across multiple models on various mathematical and code reasoning benchmarks indicate that RASFT significantly outperforms traditional SFT and its variants, demonstrating the effectiveness of rollout-adaptive supervision in enhancing reasoning performance.
Methodology
RASFT employs a policy-aware regularization strategy that utilizes verified on-policy rollouts to estimate problem solvability. It constructs a local candidate pool of expert and self-generated trajectories, adjusting the influence of expert guidance based on the model's performance. The method also introduces a clipped inverse ratio to constrain excessive policy drift, ensuring that the model retains its reasoning capabilities while benefiting from expert supervision.
Results
The experiments demonstrate that RASFT achieves a 10.9% relative improvement on the Qwen2.5-Math-1.5B math reasoning benchmark and up to 26.9% on the Llama-3.2-3B code generation task. Compared to existing methods, RASFT improves average math accuracy by 15.9%, highlighting its effectiveness in enhancing reasoning performance across various tasks.
Implications
The RASFT framework has significant implications for the development of adaptive learning systems, particularly in reasoning and problem-solving domains. By effectively balancing expert guidance with self-generated reasoning, RASFT can enhance the performance of large language models in various applications, including education, automated reasoning, and programming assistance.
Improved Convergence Analysis of Topology Dependence in Decentralized SGD
Theory
Optimization
Federated Learning
- Developed a novel proof technique for improved convergence rates in Decentralized SGD.
- Showed that the full eigenvalue spectrum of the mixing matrix governs convergence rates, not just the spectral gap.
- Provided experimental evidence that aligns theoretical predictions with observed training behaviors.
- Demonstrated that sparse topologies with small spectral gaps can perform better than previously thought.
Read more
Improved Convergence Analysis of Topology Dependence in Decentralized SGD
Summary
This paper addresses the convergence behavior of Decentralized Stochastic Gradient Descent (SGD), a key algorithm in decentralized learning, particularly focusing on how network topology influences convergence rates. Previous analyses primarily relied on the spectral gap of the mixing matrix, which is the second-largest eigenvalue in absolute value, to characterize convergence. However, this approach has shown limitations, particularly in homogeneous settings where nodes have similar data, leading to discrepancies between theoretical predictions and experimental results. The authors propose a novel convergence analysis that incorporates the entire eigenvalue spectrum of the mixing matrix, rather than just the spectral gap. This new approach provides a more accurate understanding of how different topologies affect convergence rates, especially in near-homogeneous scenarios. The authors validate their theoretical findings through experiments, demonstrating that their analysis aligns more closely with observed training behaviors. The results indicate that conventional spectral-gap-based analyses may underestimate the performance of certain topologies, such as ring and torus structures. The paper concludes by suggesting that future topology design should consider the full eigenvalue spectrum to enhance convergence rates while maintaining communication efficiency.
Methodology
The authors employed a novel analytical approach that considers all eigenvalues of the mixing matrix to derive convergence rates for Decentralized SGD. They conducted both theoretical analysis and empirical experiments to validate their findings, comparing the performance of various topologies under different conditions.
Results
The study found that the new convergence analysis leads to significantly improved rates, particularly in near-homogeneous settings. The results showed that many commonly used topologies, which were previously thought to have poor convergence due to small spectral gaps, actually perform well in practice. The theoretical predictions closely matched the experimental observations, indicating a more nuanced understanding of topology's impact on convergence.
Implications
The findings suggest that topology design in decentralized learning should focus on the full eigenvalue spectrum rather than just the spectral gap, potentially leading to more efficient communication strategies and improved convergence rates. This work could influence the design of decentralized algorithms beyond SGD, enhancing various decentralized learning methods.
Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction
Graph Learning
- Introduces a loss-guided adaptive scale refinement framework for molecular force prediction.
- Demonstrates substantial complementarity between short- and long-range force prediction branches.
- Shows that continuous scale interpolation outperforms hard routing, indicating the existence of effective intermediate scales.
- Establishes that a compact set of discrete scale anchors can approximate the continuous oracle scale space.
Read more
Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction
Summary
This paper presents a novel framework for molecular force prediction that addresses the limitations of fixed-scale modeling in molecular representation learning. The proposed loss-guided adaptive scale refinement framework treats predefined scales as initial anchors and aims to discover task-effective modeling resolutions through techniques such as scale interpolation, routing, differentiable updates, and scale pool refinement. The study uses a NaCl aqueous ionic system to construct short-scale and long-range force prediction branches, analyzing their complementarity. Results indicate that oracle hard routing improves the mean absolute error (MAE) from 399.65 to 382.67, with a notable 17.35% improvement in close-contact regimes. Continuous oracle interpolation further reduces the overall MAE to 380.96, demonstrating that optimal scales may lie between predefined endpoints. Additionally, a minimal scale pool update experiment shows that loss-guided updates can generate intermediate scale anchors, achieving an overall MAE of 381.23, close to the continuous oracle MAE. The study also evaluates learnable scale routing and differentiable scale update models, with a gradient-updated MLP scale gate achieving a non-oracle performance MAE of 396.55. These findings support the potential of adaptive scale refinement in enhancing molecular representation learning, particularly in scenarios where fixed-scale modeling is inadequate.
Methodology
The methodology involves constructing short-scale and long-range force prediction branches for a NaCl aqueous ionic system. It employs techniques such as oracle hard routing, continuous scale interpolation, and loss-guided scale updates to refine the modeling scales based on prediction loss and local molecular environments.
Results
The study reports a reduction in overall MAE from 399.65 to 382.67 with oracle hard routing, and further to 380.96 with continuous oracle interpolation. In close-contact regimes, the MAE improves significantly to 260.51. The final updated scale pool achieves an MAE of 381.23, closely approximating the continuous oracle performance.
Implications
The findings suggest that adaptive scale refinement could significantly enhance molecular representation learning, making it more effective for various molecular systems and properties, particularly in cases where traditional fixed-scale approaches fall short.
Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers
Federated Learning
Optimization
Theory
- Introduces SSD-FL, a serverless approach to semi-decentralized federated learning.
- Addresses the challenges of cluster formation in decentralized environments with heterogeneous optimizers.
- Implements a unique scoring metric for assessing data and optimizer heterogeneity.
- Demonstrates improved convergence speeds and communication efficiency in experiments.
Read more
Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers
Summary
This paper explores the challenges of cluster formation in decentralized federated learning (FL) environments, particularly when utilizing heterogeneous machine learning optimizers. The authors introduce a novel methodology called serverless semi-decentralized federated learning (SSD-FL), which eliminates the need for persistent server infrastructure. SSD-FL initiates a lightweight, one-time device-to-device (D2D) initialization phase for cluster formation, after which the training process is entirely serverless. The methodology divides global training rounds into intra-cluster and inter-cluster phases, ensuring convergence and consensus through innovative 'effective loss functions' that combine device-specific optimizers with network graph-based regularization. The authors develop an iterative clustering algorithm that leverages the Cheeger inequality to address the consensus gap and incorporates a unique scoring metric to assess data and optimizer heterogeneity. Experimental results demonstrate that SSD-FL significantly enhances convergence speeds and communication efficiency across various network configurations, datasets, and local optimizer settings, outperforming existing decentralized FL methodologies.
Methodology
The authors propose SSD-FL, which involves a one-time D2D initialization for cluster formation, followed by serverless model training. They utilize effective loss functions to ensure convergence and consensus, and develop an iterative clustering algorithm informed by the Cheeger inequality to optimize cluster composition and number.
Results
Experimental evaluations indicate that SSD-FL improves convergence rates and communication efficiency compared to traditional decentralized FL methods, effectively handling various network topologies and heterogeneous local optimizers.
Implications
The findings suggest that SSD-FL can be applied in decentralized edge/fog networks, such as energy grids and disaster recovery communications, where efficient cluster formation can lead to faster consensus and improved resource management.
Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards
Theory
- Introduces Ο-NPTSSG, a nonparametric Thompson Sampling algorithm for risk-averse bandits.
- Achieves instance-optimal regret for any continuous risk functional on distributions with bounded density and sub-Gaussian tails.
- Resolves an open problem regarding the optimality of the Ο-NPTS algorithm from previous work.
- Develops key technical contributions, including discretization lemmas that facilitate the algorithm's performance.
Read more
Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards
Summary
This paper addresses the problem of risk-averse multi-armed bandits (MAB) by introducing the Ο-NPTSSG algorithm, a nonparametric Thompson Sampling method that achieves asymptotic optimality in terms of regret for risk functionals defined on distributions with bounded density and sub-Gaussian tails. The author proves that this algorithm matches the instance-dependent lower bound of regret, specifically achieving a logarithmic regret rate that is optimal for any continuous risk functional, including well-known measures like Conditional Value-at-Risk (CVaR) and the Sharpe ratio. The work builds on previous results by Chang and Tan (2022) and resolves an open problem regarding the optimality of the Ο-NPTS algorithm. The key contributions include a discretization lemma that allows for effective control over the growing-alphabet Dirichlet posterior, which is crucial for managing the complexity of the algorithm without requiring parametric assumptions about the rewards. This paper not only establishes theoretical guarantees for risk-averse decision-making but also provides a framework that can be applied in various fields such as finance and medical trials where risk management is critical.
Methodology
The paper employs a nonparametric Thompson Sampling approach, specifically the Ο-NPTSSG algorithm, which builds a Dirichlet posterior over observed rewards. The methodology includes the use of discretization lemmas to project the Dirichlet posterior onto a fixed grid, thus managing the complexity of the algorithm and ensuring that the regret bounds are maintained at optimal levels.
Results
The primary result demonstrates that the Ο-NPTSSG algorithm achieves a regret rate of Pk(βΟk/KΟinf(Ξ½k, rΟ1)) log n + o(log n) for any continuous risk functional on the class of distributions with bounded density and sub-Gaussian tails. This establishes the algorithm as asymptotically optimal and provides the first instance-optimal guarantees for risk functionals like the Sharpe ratio without requiring parametric assumptions.
Implications
The findings have significant implications for fields requiring risk management in decision-making processes, such as finance, healthcare, and robotics. The ability to apply Thompson Sampling in a risk-averse context without strict parametric assumptions opens up new avenues for research and practical applications in these areas.
Assessing Sample Quality in Conditional Generation under Compositional Shift
Generative Models
Computer Vision
- Introduces a reference-free trust score for assessing sample quality in conditional generation.
- Combines global realism and attribute-wise faithfulness to evaluate generated samples.
- Demonstrates empirical effectiveness on biological imaging and vision benchmarks.
- Enables early sample rejection during the generation process, improving efficiency.
Read more
Assessing Sample Quality in Conditional Generation under Compositional Shift
Summary
This paper addresses the challenge of evaluating the quality of samples generated by conditional generators, particularly in scenarios where the target distribution is unavailable, such as in scientific applications. The authors propose a novel post-hoc trust score that assesses conditional samples based solely on the training distribution. This score combines two key components: global realism, which measures how well a sample aligns with the real data manifold, and attribute-wise faithfulness, which evaluates whether the sample adheres to the requested attributes over plausible alternatives. The authors demonstrate that this trust score can effectively filter, rank, and abstain from low-quality generations, even in extrapolative settings. They validate their approach empirically on datasets like RxRx1 and CelebA, showing that their method improves the preservation of real morphological structures in biological imaging and enhances downstream predictive performance. Additionally, the trust score can be applied during the generation process, allowing for early rejection of low-quality samples before full decoding.
Methodology
The authors develop a post-hoc trust score that utilizes a feature extractor to compute two metrics: global realism and attribute-wise faithfulness. The score is calculated using only the generated sample and statistics from the training set, allowing for evaluation without access to the target distribution. The method also includes a geometric alignment objective to facilitate early quality assessment during the generation process.
Results
The proposed trust score successfully filters and ranks generated samples, demonstrating reliability in both KID-based and downstream task-based validations. The empirical results indicate that the method preserves real morphological structures in biological imaging and improves predictive performance on controlled vision benchmarks.
Implications
This work has significant implications for the use of conditional generators in scientific research and applications where real samples are scarce or costly. The ability to assess sample quality without needing a reference distribution enhances the utility of generative models in exploratory settings, potentially accelerating research and development in various fields.
Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation
Theory
Efficient ML
- FBCC is the first framework for unsupervised continual clustering that integrates representation learning and clustering in a sequential manner without replay.
- The dual-phase forward-backward knowledge distillation strategy mitigates catastrophic forgetting by using lightweight student models to guide the teacher.
- FBCC provides a memory-efficient solution by storing task-specific knowledge in compact student models instead of large-scale models or past samples.
- Extensive experiments show that FBCC outperforms state-of-the-art unsupervised and supervised continual learning methods in terms of clustering accuracy.
Read more
Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation
Summary
This paper addresses the challenge of Unsupervised Continual Learning (UCL), where neural networks learn sequential tasks without labels or access to past data, leading to catastrophic forgetting. The authors introduce Unsupervised Continual Clustering (UCC) and propose a novel method called Forward-Backward Knowledge Distillation for Continual Clustering (FBCC). FBCC employs a continual teacher network with a clustering projector and lightweight task-specific student models. The dual-phase forward-backward distillation process allows the teacher to learn new clusters while preserving previously discovered cluster structures without storing past data. This approach is the first to integrate representation learning and clustering in a sequential, no-replay setting. The experiments conducted on four benchmark datasets demonstrate that FBCC significantly outperforms existing continual learning baselines in clustering accuracy while effectively reducing catastrophic forgetting.
Methodology
FBCC utilizes a continual teacher network with a clustering projector and lightweight student models. The teacher learns from new data while being regularized by previously learned student models through a dual-phase forward-backward knowledge distillation process, which helps maintain knowledge from earlier tasks without the need for past data storage.
Results
The experimental results indicate that FBCC consistently achieves higher clustering accuracy compared to existing continual learning methods, while also demonstrating a significant reduction in catastrophic forgetting across four benchmark datasets.
Implications
The proposed FBCC framework has potential applications in scenarios where data privacy is a concern, and where continuous learning from evolving data streams is required without the ability to revisit past data. This could be particularly useful in fields such as autonomous systems, online learning, and real-time data analysis.
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path
Generative Models
Theory
Audio & Speech
- Rectified Flows can encode subtle traces of training data that may not be directly observable.
- A bell-shaped curve characterizes the reconstruction gap between training and test data along the interpolation path.
- The peak location of the membership signal can be derived mathematically under certain assumptions.
- The findings are validated across different modalities, including audio and images.
Read more
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path
Summary
This paper investigates the subtle ways in which generative models, specifically Rectified Flows, can retain information from their training data, raising concerns about copyright and privacy. The authors focus on the interpolation path defined by XΞ» = (1 β Ξ»)X0 + Ξ»X1, which allows for an analysis of how the model treats training versus held-out data. They demonstrate that there exists a gap in the reconstruction accuracy between training and test data that follows a bell-shaped curve as Ξ» varies, indicating that membership signals emerge during the interpolation process. The authors derive a closed-form expression for the peak location of this curve under Gaussian assumptions and validate their findings through experiments on audio and image datasets. As a proof of concept, they utilize this Ξ»-resolved structure to conduct a Membership Inference Attack, successfully distinguishing members of the training set from non-members. The study highlights the importance of understanding these membership signals in the context of generative models and their implications for data privacy.
Methodology
The authors analyze the interpolation path of Rectified Flows to characterize membership signals. They derive mathematical expressions for the reconstruction gap and validate their predictions through experiments on various datasets. Additionally, they implement a Membership Inference Attack to demonstrate the exploitability of the membership signals.
Results
The study confirms the existence of a bell-shaped curve in the reconstruction error between training and test data, with a peak that can be predicted mathematically. The experiments validate the universality of this structure across different data modalities, and the Membership Inference Attack successfully distinguishes training set members from non-members.
Implications
The findings have significant implications for the deployment of generative models, particularly concerning copyright and privacy. Understanding how models retain information about their training data can inform better practices for data handling and model training to mitigate risks of unauthorized data leakage.
Inferring hidden forcing in a biological oscillator using Kolmogorov-Arnold networks
Interpretability
Time Series
Theory
- Kolmogorov-Arnold networks (KAN) effectively reconstruct hidden forces in dynamical systems from partial observations.
- The study reveals a two-phase activation pattern in avian respiratory dynamics that is not apparent from pressure measurements alone.
- Electromyographic recordings validate the predictions made by the reconstructed dynamics.
- The approach highlights the potential of interpretable machine learning methods in uncovering hidden structures in complex biological systems.
Read more
Inferring hidden forcing in a biological oscillator using Kolmogorov-Arnold networks
Summary
This paper addresses the challenge of inferring the hidden forces driving a dynamical system from partial observations, specifically focusing on avian respiratory dynamics. The authors utilize Kolmogorov-Arnold networks (KAN) to reconstruct the effective muscular forcing from air-sac pressure measurements alone. The study reveals a nontrivial structure in the underlying forcing that is not evident from the pressure signal, which suggests a relaxation-like oscillation. The reconstructed dynamics predict a two-phase activation pattern within each respiratory cycle, validated through electromyographic recordings of expiratory muscles. This work demonstrates that data-driven reconstruction of dynamical laws can uncover hidden physical structures and provide insights into unobserved driving variables in partially observed systems, establishing a general approach for inferring latent forces in such contexts.
Methodology
The authors applied Kolmogorov-Arnold networks (KAN), a type of neural architecture that represents multivariate functions as compositions of trainable univariate functions. This approach allows for the extraction of interpretable functional relationships from data without requiring strong prior assumptions about the form of the governing equations.
Results
The KAN-based reconstruction revealed a qualitatively different organization of the underlying muscular forcing compared to what was suggested by the air-sac pressure signal. The inferred dynamics indicated a two-phase activation structure within each respiratory cycle, which was subsequently validated through independent electromyographic recordings.
Implications
The findings suggest that KANs can be a powerful tool for uncovering hidden dynamics in biological systems, potentially leading to better understanding and modeling of complex physiological processes. This methodology could be applied to other fields where only partial observations of a system are available, enhancing the ability to infer latent variables and dynamics.
SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification
Time Series
- Introduction of SafeECGMatch, a novel SSL framework for ECG classification addressing label distribution mismatch.
- Implementation of a dual-view calibration mechanism that integrates time and frequency domain learning.
- Demonstration of state-of-the-art accuracy and calibration performance on benchmark ECG datasets.
- Focus on reducing overconfidence in predictions, enhancing model reliability in clinical applications.
Read more
SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification
Summary
The paper presents SafeECGMatch, a novel calibration-aware safe semi-supervised learning (SSL) framework designed for single-label ECG classification, particularly addressing the challenges posed by label distribution mismatch in clinical settings. Traditional SSL approaches often fail when unlabeled data includes out-of-distribution (OOD) samples, leading to overconfident predictions and unreliable model performance. SafeECGMatch tackles this issue by employing a dual-branch architecture that extracts time-frequency latent representations from ECG signals using ECG-specific augmentations. The framework dynamically aligns prediction confidence with empirical accuracy through adaptive label smoothing and temperature scaling, effectively calibrating both the multiclass classifier and the OOD detector. This joint optimization enhances the model's ability to reject unreliable OOD samples and improve pseudo-labeling accuracy. The proposed method demonstrates state-of-the-art performance on benchmark datasets PTB-XL and PhysioNet/CinC Challenge, showcasing significant advancements in both classification accuracy and calibration, thereby facilitating more reliable knowledge discovery in physiological time-series data.
Methodology
SafeECGMatch utilizes a dual-branch architecture to extract time-frequency representations from ECG signals. It employs ECG-specific augmentations and incorporates adaptive label smoothing and temperature scaling to calibrate prediction confidence. The framework jointly optimizes the multiclass classifier and OOD detector to improve pseudo-labeling and OOD rejection.
Results
SafeECGMatch achieved state-of-the-art accuracy and calibration on the PTB-XL and PhysioNet/CinC Challenge benchmarks, significantly outperforming existing methods in both metrics, thereby validating its effectiveness in handling label distribution mismatch and enhancing model reliability.
Implications
The proposed framework has significant implications for clinical ECG classification, enabling more reliable automated systems that can effectively handle diverse and potentially misleading unlabeled data. This advancement could lead to improved diagnostic tools and better patient outcomes in cardiovascular care.
Spatiotemporal Imputation with Graph-Informed Flow Matching
Generative Models
Graph Learning
Time Series
- Introduction of GiFlow, a novel framework for spatiotemporal imputation.
- Utilization of a graph-informed prior based on adaptive spatiotemporal filtering.
- Demonstration of GiFlow's superior performance over existing methods across various datasets.
- Integration of spatial and temporal attention mechanisms for improved modeling.
Read more
Spatiotemporal Imputation with Graph-Informed Flow Matching
Summary
The paper addresses the challenge of missing data in spatiotemporal systems, which is prevalent in applications like air quality monitoring and urban traffic management. Traditional machine learning methods, such as recurrent and graph neural networks, often suffer from error accumulation due to iterative propagation. While recent diffusion-based methods have improved this by mitigating error propagation, they still rely on problem-agnostic Gaussian priors and require extensive iterative sampling, which can hinder efficiency. To overcome these limitations, the authors propose GiFlow, a novel Graph-Informed Flow Matching framework that utilizes a graph-informed prior derived from spatiotemporal filtering of observable signals. This approach aligns the source distribution more closely with the target, simplifying the generation process. GiFlow incorporates a hybrid vector field model that combines spatial and temporal attention mechanisms, allowing for effective joint modeling of dependencies. The authors conduct extensive experiments on both synthetic and real-world datasets, demonstrating that GiFlow outperforms existing state-of-the-art methods in spatiotemporal imputation tasks.
Methodology
The proposed GiFlow framework replaces traditional Gaussian priors with a graph-informed prior constructed through spatiotemporal filtering. It employs a hybrid vector field model that integrates spatial and temporal attention, allowing for efficient joint modeling of dependencies without the iterative propagation typical in RNNs and GNNs. The framework is designed to facilitate deterministic sampling, enhancing efficiency and robustness.
Results
Extensive experiments reveal that GiFlow achieves competitive or superior performance in spatiotemporal imputation tasks compared to state-of-the-art baselines, effectively handling diverse missing patterns and rates in both synthetic and real-world datasets.
Implications
GiFlow has significant potential applications in fields requiring reliable spatiotemporal data analysis, such as environmental monitoring, urban planning, and climate forecasting. Its ability to efficiently impute missing data can enhance the reliability of analyses and decision-making processes in these domains.
Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments
Efficient ML
- Introduction of a TTA-aware composability model to assess service compatibility in MLaaS compositions.
- Development of a service-level adaptation model to regulate personalized adaptations during inference.
- Demonstration of improved computational efficiency over traditional adaptive approaches.
- Focus on enabling personalized adaptation at the composition level, addressing the dynamic nature of IoT environments.
Read more
Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments
Summary
This paper addresses the challenges of adapting Machine Learning as a Service (MLaaS) compositions in dynamic Internet of Things (IoT) environments. Traditional adaptive composition methods often rely on service substitution or re-composition, which can be time-consuming and ineffective. The authors propose a novel Test-Time Adaptive (TTA) composition framework that introduces a TTA-aware composability model to evaluate the compatibility of adapted services within existing compositions. Additionally, a service-level adaptation model is designed to adjust individual services during inference while maintaining overall composition performance. The framework allows for personalized adaptation at the composition level, leveraging TTA to handle domain shifts without requiring service replacement. Experimental results indicate that this approach significantly reduces computational time compared to traditional methods, demonstrating its effectiveness in enhancing the adaptability and performance of MLaaS in IoT settings.
Methodology
The authors developed a TTA composition framework that includes a TTA-aware composability model and a service-level adaptation model. The composability model employs five adaptive rules to evaluate the compatibility of services post-adaptation. The service-level adaptation model regulates updates to prevent bias and maintain performance across the composition.
Results
The proposed framework showed a significant reduction in computational time compared to traditional adaptive composition methods. It effectively maintained performance while allowing for personalized adaptations in response to dynamic data shifts in IoT environments.
Implications
The findings suggest that the TTA composition framework can enhance the robustness and efficiency of MLaaS applications in IoT, enabling better performance in real-time scenarios. This approach could be applied in various domains, such as healthcare and smart cities, where adaptability to changing conditions is crucial.
RECAP: Regression Evaluation for Continual Adaptation of Prompts
NLP
Large Language Models
- RECAP benchmark measures continual-learning phenomena at the constraint level under a proactive protocol.
- Existing prompt adaptation methods are inadequate for proactive adaptation, showing no significant performance improvement.
- The study highlights the need for new methodologies that can adapt to evolving constraints without prior feedback.
- The benchmark transforms static datasets into temporal evaluation streams, allowing for rigorous evaluation of prompt adaptation methods.
Read more
RECAP: Regression Evaluation for Continual Adaptation of Prompts
Summary
The paper introduces RECAP, a benchmark designed to evaluate the continual adaptation of prompts in production agentic systems facing evolving constraints. Current benchmarks typically assume static constraints or rely on reactive protocols, which do not reflect the proactive adaptation required in real-world applications. RECAP focuses on a proactive adapt-then-test protocol where prompt optimization methods must generalize from constraint specifications without prior exposure to test data. The authors evaluate six prompt adaptation methods across four large language models (LLMs) and three schedules with evolving constraints, revealing that these methods show no significant improvement in performance, highlighting their inadequacy for proactive settings. The study emphasizes the necessity for developing robust proactive prompt adaptation techniques that can handle evolving requirements in deployment scenarios.
Methodology
The authors developed RECAP by transforming static instruction-following datasets into temporal evaluation streams through typed operations (ADD, EDIT, DELETE) that evolve constraints. They employed a proactive adapt-then-test protocol, where methods adapt to new constraints without seeing test data. The evaluation metrics included constraint satisfaction, regression, edit uptake, unlearning fidelity, and efficiency.
Results
The evaluation of six prompt adaptation methods revealed that none of the methods significantly improved performance in the proactive setting, even with increased latency. This indicates that existing methods, which are designed for offline or reactive scenarios, are not suitable for the proactive adaptation required in real-world applications.
Implications
The findings suggest a critical need for the development of new prompt adaptation methods that can effectively handle evolving constraints in production environments. This could lead to more robust and reliable agentic systems capable of maintaining compliance with dynamic requirements.
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
Theory
Efficient ML
Optimization
- Introduces a Mathematics of Arrays (MoA) framework for optimizing transformer attention mechanisms.
- Achieves theoretical minimum memory traffic by eliminating intermediate arrays through algebraic construction.
- Demonstrates a formal lower bound for data movement, significantly reducing memory costs compared to standard implementations.
- Projects substantial speedup and energy reduction in real-world applications, especially at large sequence lengths.
Read more
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
Summary
This paper addresses the inefficiencies of the standard implementation of the attention mechanism in transformer models, which incurs quadratic memory traffic relative to sequence length. The authors introduce a Mathematics of Arrays (MoA) framework that reformulates scaled dot-product attention and its softmax normalization to minimize memory usage. By deriving a Denotational Normal Form (DNF), the authors eliminate all intermediate arrays, achieving the theoretical minimum memory traffic through algebraic construction. This approach contrasts with existing optimization strategies that often rely on hardware-specific solutions or empirical tuning. The paper also presents a formal lower bound argument demonstrating that the DNF achieves O(ndk + ndv) data movement, significantly improving upon the standard implementation's O(nΒ² + ndk + ndv). The authors project substantial performance improvements, including 2β100Γ speedup and 2β50Γ energy reduction across various deployment scenarios, particularly at exascale sequence lengths. The MoA framework provides a rigorous foundation for developing performance-portable AI kernels, relevant to both DARPA edge-deployment and DOE exascale computing priorities.
Methodology
The authors apply the Mathematics of Arrays (MoA) framework to reformulate the attention mechanism, deriving a Denotational Normal Form (DNF) that eliminates intermediate arrays. They utilize systematic application of MoA's psi-reduction calculus on a PyTorch reference implementation and verify the results numerically at full double-precision accuracy.
Results
The derived DNF achieves O(ndk + ndv) data movement, matching the information-theoretic minimum and outperforming the standard implementation's O(nΒ² + ndk + ndv). The predictive performance model indicates potential speedups of 2β100Γ and energy reductions of 2β50Γ across various deployment scenarios, with benefits increasing at exascale sequence lengths.
Implications
This work has significant implications for the development of efficient transformer models, particularly in resource-constrained environments. The MoA framework could lead to more portable and optimized AI kernels, enhancing performance in both edge-deployment scenarios and large-scale scientific computing.
OffQ: Taming Structured Outliers in LLM Quantization by Offsetting
Large Language Models
Efficient ML
NLP
- Introduces OffQ, a method to mitigate activation outliers in low-bit quantization.
- Utilizes a top-1 PCA to identify low-dimensional outlier subspaces.
- Concentrates outliers into fewer channels and converts them into group-wise offsets.
- Achieves effective W4A4KV4 quantization with uniform precision.
Read more
OffQ: Taming Structured Outliers in LLM Quantization by Offsetting
Summary
The paper presents OffQ, a novel method aimed at improving low-bit quantization of large language models (LLMs) by addressing the challenge posed by activation outliers. These outliers can significantly degrade performance during quantization, particularly in the context of 4-bit quantization (W4A4KV4). OffQ employs a unique offsetting mechanism that first identifies a low-dimensional subspace of outliers using a top-1 PCA approach. It then concentrates high-magnitude activations into a single channel through rotation, allowing the outlier energy to be transformed into a shared offset. This strategy reduces the standard deviation of activations, making them more amenable to effective quantization. The method enables uniform-grid and uniform-precision quantization without the need for mixed-precision components or complex transformations. Extensive experiments demonstrate that OffQ consistently outperforms state-of-the-art baselines across various LLM architectures and benchmarks, achieving improved model accuracy while maintaining low-bit efficiency.
Methodology
OffQ identifies structured activation outliers through a top-1 PCA approach, rotates activations to concentrate these outliers into a minimal number of channels, and applies Hadamard rotation to convert the concentrated outlier energy into group-wise offsets. This process reduces the standard deviation of the activations, facilitating effective quantization.
Results
The experiments conducted across various LLM architectures and benchmarks show that OffQ significantly enhances the perplexity and accuracy of models under 4-bit quantization settings compared to state-of-the-art methods, while preserving the efficiency of low-bit inference.
Implications
OffQ provides a practical solution for deploying large language models in resource-constrained environments, enhancing the feasibility of low-cost LLM deployments without compromising model performance.
STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling
Graph Learning
Time Series
Generative Models
- STELLAR addresses the limitations of existing JSDM approaches by integrating spatio-temporal dynamics and community structure.
- The framework employs a Graph-Temporal Encoder to capture historical environmental changes and species interactions.
- A novel Context-Anchored Latent Alignment mechanism enhances species clustering based on environmental preferences.
- The Imbalance-Aware Decoupled Decoding module effectively tackles the long-tail distribution of species.
Read more
STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling
Summary
The paper presents STELLAR, a novel framework for Joint Species Distribution Modeling (JSDM) that addresses the challenges of spatio-temporal dynamics and long-tailed species distributions. Traditional JSDM approaches often treat environmental factors and species distributions in isolation, leading to inadequate modeling of complex ecological interactions. STELLAR integrates three key components: a Graph-Temporal Encoder that captures spatial and temporal dynamics through graph attention and recurrent units; a Context-Anchored Latent Alignment mechanism that structures the latent space to cluster species based on shared environmental preferences; and an Imbalance-Aware Decoupled Decoding module that employs Asymmetric Loss to focus on rare species, mitigating the long-tail problem. The framework was validated using a large-scale eBird dataset, demonstrating superior performance over existing models, particularly in predicting rare species and elucidating species interactions. The collaboration with conservation biologists ensured the model's ecological validity and operational relevance for biodiversity monitoring.
Methodology
The methodology involves a three-component framework: (1) a Graph-Temporal Encoder for capturing spatial and temporal dynamics, (2) a Context-Anchored Latent Alignment mechanism for structuring the latent space, and (3) an Imbalance-Aware Decoupled Decoding module that utilizes Asymmetric Loss to prioritize learning from rare species samples.
Results
STELLAR significantly outperformed state-of-the-art baselines in predicting rare species and provided interpretable insights into species interactions, demonstrating its effectiveness in addressing the challenges of long-tailed species distributions.
Implications
The proposed framework has potential applications in biodiversity monitoring and conservation planning, enabling more accurate predictions of species distributions and informing conservation strategies.
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution
Theory
Efficient ML
Large Language Models
- GRASP formalizes pretraining data attribution as subset-level counterfactual utility prediction.
- The method incorporates a quadratic geometric penalty to model interactions between data subsets.
- GRASP significantly reduces artifact construction costs and improves efficiency in evaluating data subsets.
- The approach demonstrates superior performance compared to existing scalable data attribution methods.
Read more
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution
Summary
The paper introduces GRASP, a novel method for scalable data attribution in machine learning that addresses the limitations of existing methods which assign isolated utility scores to individual training examples. Traditional approaches fail to account for the interactions between data subsets, leading to suboptimal data curation. GRASP reframes data attribution as subset-level counterfactual utility prediction, incorporating an interaction-aware surrogate that models subset dynamics through a quadratic geometric penalty. This method is grounded in a theoretical framework that ensures smoothness in utility predictions. By utilizing low-dimensional feature sketches, GRASP achieves significant efficiency improvements, allowing for rapid evaluation of large data subsets without the need for extensive retraining. The authors demonstrate that GRASP outperforms existing scalable baselines, achieving over double the task-level rank correlation for counterfactual subset fidelity while drastically reducing the time and resources required for artifact construction. The findings suggest that GRASP can effectively enhance data curation processes in large-scale pretraining scenarios, providing a robust foundation for optimizing pretraining corpora across various applications.
Methodology
GRASP employs a geometry-aware residual alignment approach that combines an additive relevance core with a quadratic geometric penalty to predict the counterfactual utility of data interventions. It utilizes low-dimensional feature sketches to enhance computational efficiency and avoids the need for combinatorial retraining by selecting weights and modes in development environments before testing.
Results
GRASP achieved over double the task-level rank correlation for counterfactual subset fidelity compared to existing methods. It also reduced the time required for artifact construction by nearly an order of magnitude, scoring 100,000 subsets in just 5 seconds.
Implications
The development of GRASP has significant implications for data curation in machine learning, particularly in optimizing pretraining datasets for large language models and other applications. Its ability to efficiently evaluate and rank data subsets can lead to improved model performance and more effective data management strategies.
Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory
Theory
Optimization
Efficient ML
- Establishes a mathematical framework for understanding deep representation learning.
- Unifies classical and modern approaches to data representation and compression.
- Introduces auto-encoding architectures for self-correction and improvement.
- Connects theoretical principles to practical applications in AI tasks.
Read more
Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory
Summary
This manuscript serves as a textbook aimed at providing a systematic introduction to the mathematical and computational principles of deep representation learning, focusing on how to effectively learn low-dimensional distributions from high-dimensional data. The authors argue that understanding these representations is crucial for developing intelligent systems, which they define as the ability to create and correct memory from empirical knowledge. The book contrasts deductive methodologies with the prevalent inductive approaches in AI, advocating for a principled understanding of intelligence as a scientific subject. It presents a unified framework that reconciles classical analytical models with modern data-driven techniques, emphasizing the importance of compression as a common principle across various methodologies. The content is organized into six chapters, covering topics from basic models like PCA and ICA to advanced concepts such as auto-encoding architectures and Bayesian inference. The authors aim to clarify misconceptions about deep neural networks and provide guiding principles for future intelligent system development.
Methodology
The authors employ a deductive approach to develop a theoretical framework that integrates classical models (like PCA and ICA) with modern deep learning techniques. They explore concepts such as compression, auto-encoding, and Bayesian inference to create a comprehensive understanding of representation learning.
Results
The book provides a structured methodology for learning low-dimensional representations, demonstrating how various computational techniques can be unified under a common theoretical framework. It also offers insights into new architecture designs for deep learning that are simpler and more efficient.
Implications
The framework proposed in this book has the potential to enhance the development of intelligent systems by providing a clearer understanding of representation learning. It could lead to more effective AI applications in various domains, including image and text generation, conditional estimation, and data completion.
GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks
Graph Learning
Time Series
- GeoGNN is a two-tower architecture combining spatial and temporal learning for time series geolocalization.
- The model leverages graph neural networks to embed geographic candidates and extract features from time series data.
- GeoGNN outperforms traditional baselines, enhancing geolocalization accuracy by about 27% on average.
- The approach addresses unique challenges in geolocalizing time series, which lack explicit geographic cues.
Read more
GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks
Summary
This paper introduces GeoGNN, a novel approach for time series geolocalization aimed at inferring the geographic origin of raw time series data. The authors highlight the importance of geolocalization in providing spatial context to time series, which can enhance various location-aware applications. The proposed GeoGNN architecture consists of two towers: a spatial tower that learns embeddings of geographic cell candidates using a geographic adjacency graph, and a temporal tower that extracts informative representations from time series data. During inference, the model matches temporal representations with geographic embeddings using dot-product similarity, supplemented by an auxiliary classification head. The authors establish strong baselines by adapting methods from image geolocalization and conduct extensive experiments on large-scale electricity consumption datasets from the US and Spain. The results demonstrate that GeoGNN significantly outperforms existing methods, achieving an average increase of approximately 27% in both fine- and coarse-grained geolocalization accuracy.
Methodology
GeoGNN employs a two-tower architecture where the spatial tower uses graph attention networks to learn embeddings of geographic cell candidates, while the temporal tower utilizes a TimesNet-based backbone to extract features from time series data. The embeddings are fused in a two-tower fusion module, and an auxiliary classification head is used for final predictions.
Results
GeoGNN achieved the best performance across multiple datasets, improving geolocalization accuracy by approximately 27% on average compared to established baselines, which included retrieval-based methods and classification models adapted from image geolocalization.
Implications
The ability to geolocalize time series data can enhance decision-making in various fields such as healthcare, environmental monitoring, and smart city applications. It also raises concerns about privacy in data collection practices, highlighting the need for more robust privacy measures.
The Geometry of Last-Layer Model Stealing
Theory
Large Language Models
NLP
- Introduces a geometric perspective on last-layer model stealing using exterior differential systems.
- Identifies the polar space of the quadratic generator as key to recovering the projection matrix.
- Demonstrates that the intrinsic dimension of the hidden-state manifold reveals information about nonlinear sublayers.
- Establishes a clear identifiability boundary for parameters beneath the last layer.
Read more
The Geometry of Last-Layer Model Stealing
Summary
This paper provides a geometric interpretation of the last-layer model-stealing attack originally proposed by Carlini et al. The author utilizes the framework of exterior differential systems (EDS) to analyze the recovery of the final embedding projection layer in transformer models. The study reveals that the logit vectors emitted by a transformer form a common zero locus of an ideal characterized by both linear and quadratic components. The recovery of the projection matrix is governed by the polar space of the quadratic part, which corresponds to the tangent space of the output manifold. The author establishes that recovery is successful under specific regularity conditions analogous to KΓ€hler-regularity. Additionally, the paper explores the intrinsic dimension of the recoverable hidden-state manifold, identifying it as a crucial observable that can detect nonlinear sublayers and measure their effective rank. The findings indicate that many parameters of these sublayers lie in non-identifiable fibers, explaining the limitations of the model-stealing attack to only the last layer. The paper emphasizes that while the EDS framework organizes the findings, the results are fundamentally grounded in classical neural network identifiability principles.
Methodology
The author employs exterior differential systems to recast the last-layer model-stealing attack, analyzing the geometry of logit vectors and their recovery through singular value decomposition and the properties of quadratic surfaces. The study includes numerical verification on a controlled toy model.
Results
The paper confirms that recovery of the projection matrix is feasible under specific regularity conditions. It also reveals that the intrinsic dimension of the hidden-state manifold can detect nonlinear sublayers, and many parameters remain non-identifiable, which limits the attack's effectiveness to just the last layer.
Implications
The findings have implications for understanding the limitations of model-stealing attacks on language models, emphasizing the importance of the last layer while providing insights into the structure of hidden layers. This could inform the design of more secure models and enhance the understanding of model identifiability.
Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors
Reinforcement Learning
Large Language Models
Graph Learning
- The K-nearest neighbour approach using knowledge graphs achieves competitive performance in predicting transcriptomic perturbations.
- Reinforcement learning can optimize LLMs to enhance their predictive capabilities for biological responses.
- The proposed methods generalize well to unseen perturbations, indicating their robustness.
- The RL-optimized LLM improves performance on downstream tasks, such as differential expression prediction.
Read more
Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors
Summary
This paper addresses the challenge of predicting the effects of unseen gene knockout perturbations on transcriptomic gene expression using knowledge graphs and large language models (LLMs). The authors propose a K-nearest neighbour (KNN) approach based on biological knowledge graphs, which demonstrates competitive performance in predicting perturbation effects. They further enhance this model by training an LLM via reinforcement learning (RL) to optimize the selection of gene neighborhoods, leading to improved predictive accuracy. The results indicate that the KNN model outperforms most existing methods in out-of-distribution perturbation predictions. Additionally, the RL-optimized LLM shows equivalent performance to state-of-the-art methods on specific cell lines, and it also enhances performance on downstream tasks such as differential expression prediction. This work highlights the potential of knowledge graphs as effective model priors and suggests that RL can refine LLMs into robust tools for predicting complex biological responses.
Methodology
The authors utilize a K-nearest neighbour approach informed by biological knowledge graphs to predict gene perturbation effects. They then train a reasoning LLM using reinforcement learning to refine the selection of gene neighborhoods, rewarding the model for improvements in predictive performance. This combination allows for effective generalization to unseen perturbations.
Results
The KNN model outperforms nearly all existing methods for out-of-distribution perturbation predictions. The RL-optimized LLM achieves performance comparable to state-of-the-art models on specific cell lines and shows improved results on the differential expression prediction task, despite not being directly trained for it.
Implications
This research suggests that integrating knowledge graphs with LLMs can significantly enhance predictive modeling in computational biology. The findings may lead to more accurate virtual cell models and improve our understanding of biological processes, ultimately aiding in drug discovery and personalized medicine.
Beyond Linear and Overcomplete Regimes: A Mean-Field Analysis of Bottleneck Autoencoders
Theory
Optimization
- The paper provides a theoretical framework for analyzing nonlinear bottleneck autoencoders using mean-field methods.
- It establishes that the learning dynamics of finite-width networks closely track the mean-field risk trajectory.
- The study highlights the importance of both nonlinearity and bottleneck constraints in representation learning.
- The authors derive a system of coupled PDEs to characterize the learning dynamics of BAEs.
Read more
Beyond Linear and Overcomplete Regimes: A Mean-Field Analysis of Bottleneck Autoencoders
Summary
This paper addresses the theoretical understanding of nonlinear bottleneck autoencoders (BAEs), which have been underexplored in the context of representation learning. The authors derive mean-field (MF) learning dynamics for both the encoder and decoder in a finite-dimensional bottleneck setting, providing a framework that captures the optimization dynamics of these models. They demonstrate that the empirical risk of finite-width networks trained with stochastic gradient descent closely follows the MF risk trajectory, indicating that finite networks can approximate the infinite-width solutions effectively. This work fills a critical gap in the theoretical analysis of AEs by focusing on the interplay between nonlinearity and compression, which is essential for understanding the learned representations in practical applications.
Methodology
The authors employ mean-field analysis to derive the learning dynamics of nonlinear bottleneck autoencoders. They focus on shallow autoencoders with a one-dimensional bottleneck and characterize the training dynamics through a system of coupled partial differential equations (PDEs). The analysis also involves stochastic gradient descent to relate empirical risk with mean-field risk.
Results
The main results indicate that the empirical risk of finite-width networks trained with stochastic gradient descent closely tracks the mean-field risk trajectory with high probability. At optimality, the finite-width risk converges to the mean-field optimum, demonstrating that finite networks are capable of approximating the solutions of infinite-width networks.
Implications
The findings have significant implications for the design and understanding of neural networks in representation learning, particularly in applications requiring efficient compression of high-dimensional data. This work can enhance the theoretical foundation for deploying autoencoders in critical domains such as healthcare and data analysis.