AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Time Series
- A single-layer transformer can match the performance of deeper models in time series forecasting.
- Expanding the dictionary size in sparse autoencoders yields minimal impact on forecasting performance.
- Transformers do not rely on superposition for effective representation in time series tasks.
- The findings suggest that the complexity of transformers may not be justified for time series forecasting.
Read more
Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Summary
This paper investigates the internal representations of transformer models, specifically PatchTST, in the context of time series forecasting. The author addresses the ongoing debate regarding the effectiveness of complex transformer architectures compared to simpler models like DLinear. By employing sparse autoencoders (SAEs) for mechanistic interpretability, the study reveals that a single-layer, narrow-dimensional transformer can achieve comparable forecasting performance to deeper configurations across various benchmarks. The analysis shows that expanding the dictionary size of the SAEs results in negligible changes in performance, indicating that the representations learned by the transformers are sparse and stable, rather than relying on superposition. This finding suggests that the competitive performance of transformers in time series forecasting does not necessitate the rich compositional representations that are crucial in natural language processing, providing a mechanistic explanation for the success of simpler linear models in this domain.
Methodology
The study utilizes sparse autoencoders (SAEs) to analyze the internal representations of the PatchTST transformer model. The author conducts experiments with varying dictionary sizes on the post-GELU intermediate feedforward network (FFN) activations to assess the impact on forecasting performance and the nature of the learned representations.
Results
The analysis demonstrates that a single-layer PatchTST achieves competitive forecasting performance across benchmarks, and that increasing the dictionary size of the SAEs does not significantly alter performance. Additionally, targeted interventions on dominant latent features produce minimal changes in forecasts, indicating that the representations are sparse and stable, contradicting the superposition hypothesis.
Implications
These findings challenge the necessity of complex transformer architectures for time series forecasting, suggesting that simpler models may suffice. This could influence future research and model development in time series analysis, emphasizing the need for mechanistic understanding over mere empirical performance comparisons.
Towards Metric-Faithful Neural Graph Matching
Graph Learning
Theory
Optimization
- Introduces a geometric framework linking encoder geometry to GED estimation quality.
- Demonstrates that bi-Lipschitz encoders improve stability and accuracy in GED surrogates.
- Establishes a theoretical basis for the impact of encoder distortion on downstream estimators.
- Empirical results show significant performance improvements using geometry-aware variants.
Read more
Towards Metric-Faithful Neural Graph Matching
Summary
This paper addresses the challenge of estimating Graph Edit Distance (GED), a crucial metric for structural graph similarity that is NP-hard to compute. The authors propose a theoretical framework that connects the geometry of graph encoders, specifically focusing on bi-Lipschitz properties, to the quality of GED estimation. They categorize neural GED estimators into two classes: graph similarity predictors and matching-based methods, and demonstrate that the encoder's geometric properties significantly influence the performance of these estimators. By employing a bi-Lipschitz encoder, FSW-GNN, as a replacement in existing neural architectures, the authors show that the resulting models yield improved GED predictions and ranking stability across various benchmarks. Their findings suggest that the geometry of the encoder is a critical factor in enhancing the faithfulness of neural graph matching systems, providing a new perspective on the design of these models.
Methodology
The authors develop a theoretical framework that analyzes the relationship between encoder geometry and GED estimation. They focus on bi-Lipschitz properties of graph encoders and apply FSW-GNN, a bi-Lipschitz WL-equivalent encoder, in various neural GED architectures. The methodology includes theoretical proofs and empirical evaluations across benchmark datasets to validate their claims.
Results
The implementation of geometry-aware encoders led to significant improvements in GED prediction and ranking metrics across multiple model families. The results indicate that better encoder geometry enhances the conditioning of surrogate quantities used for GED estimation, confirming the theoretical assertions made in the paper.
Implications
The findings suggest that incorporating geometric considerations into the design of neural graph matching systems can lead to more accurate and reliable models. This has potential applications in fields such as molecular retrieval, program analysis, and structured search, where graph similarity is crucial.
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Theory
Efficient ML
Time Series
- Introduces a context-conditioned Flux Neural Operator that leverages Recurrent Vision Transformers for enhanced performance in solving conservation laws.
- The model infers latent numerical flux operators from short observed trajectories, allowing for adaptability without explicit PDE knowledge.
- Demonstrates improved autoregressive stability and robustness compared to traditional PDE foundation models on benchmark problems.
- Preserves the conservative structure of numerical updates, crucial for accurate long-time predictions in nonlinear hyperbolic problems.
Read more
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Summary
This paper presents a novel architecture that enhances the Flux Neural Operator (Flux NO) by integrating a context injection mechanism based on Recurrent Vision Transformers (ViTs). The proposed model operates as a hypernetwork, which captures solution dynamics over a finite temporal window and encodes them using a recurrent ViT. This allows the model to generate parameters for a context-conditioned neural operator, enabling it to solve conservation laws without needing explicit knowledge of the governing equations or partial differential equation (PDE) coefficients. The authors demonstrate that their method retains the robustness and generalization capabilities of Flux NO while providing reliable numerical solutions across various conservative systems, including those with previously unseen fluxes. The architecture is particularly beneficial for long-time predictions and autoregressive stability, outperforming standard neural operators in terms of out-of-distribution (OOD) robustness and conservation law adherence. The code for the implementation is publicly available.
Methodology
The authors develop a hypernetwork architecture that utilizes a recurrent Vision Transformer to encode solution dynamics from short trajectories. This encoded context is then used to condition a finite-volume flux operator, allowing the model to adaptively learn and predict numerical fluxes for conservation laws without direct access to the governing equations.
Results
The proposed model demonstrates superior performance in terms of robustness and generalization on one-dimensional conservation law benchmarks and a diffusive Burgers-type problem. It effectively maintains the conservative structure of numerical updates, leading to stable long-time predictions and improved performance on unseen flux scenarios.
Implications
This work has significant implications for scientific computing and numerical simulations, particularly in fields requiring the solution of conservation laws. The ability to adaptively learn from context can enhance the efficiency and accuracy of simulations across various physical regimes, potentially benefiting applications in fluid dynamics, climate modeling, and other areas governed by PDEs.
Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization
Optimization
Theory
- Existing benchmarks in Bayesian Optimization often fail to account for regime variables, leading to unreliable performance rankings.
- The Portable Regime Score (PRS) is introduced as a method to quantify and predict the impact of regime variables on algorithm performance.
- The REGIMEPLANNER framework demonstrates the practical application of PRS, outperforming traditional acquisition strategies in various benchmarks.
- A significant portion of the literature does not vary critical parameters, which skews the reported effectiveness of algorithms.
Read more
Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization
Summary
This paper addresses the limitations of existing benchmarks in Bayesian Optimization (BO) by introducing the concept of regime-conditioned evaluation. The author critiques the prevalent practice of averaging treatment effects across hidden regime variables, which leads to unreliable performance rankings in the literature. A comprehensive audit of 40 papers reveals that 98% do not control for the budget-to-action ratio (B/|A|), resulting in misleading conclusions. The paper proposes the Portable Regime Score (PRS), which quantifies the regime based on B/|A| and the Spearman rank correlation between prior means and observed outcomes. The PRS is shown to predict the performance of different acquisition strategies across various contexts effectively. The REGIMEPLANNER, a proposed framework, utilizes PRS to adaptively switch between acquisition strategies, achieving superior performance in hyperparameter optimization tasks. The findings highlight the necessity of considering regime variables in BO evaluations to ensure accurate measurement and comparison of algorithms.
Methodology
The paper employs a hierarchical modeling approach to analyze the impact of regime variables on Bayesian Optimization performance. It introduces the Portable Regime Score (PRS) to quantify the regime and uses it to inform the REGIMEPLANNER framework, which adaptively selects acquisition strategies based on observed conditions. The methodology includes extensive empirical evaluations across multiple benchmarks, including GDSC2 and HPO-B, to validate the effectiveness of PRS and REGIMEPLANNER.
Results
The results indicate that the PRS can accurately predict the performance of acquisition strategies in 74.7% of cases across 79 conditions. The REGIMEPLANNER outperformed Greedy and UCB strategies in all 16 hyperparameter optimization search spaces at a budget of 100, achieving a mean improvement of +0.103 Hit@1. The study also found that 19% of conditions fell into an equivalence zone where differences in performance were negligible, yet still generated significant publication results.
Implications
The findings suggest that incorporating regime variables into Bayesian Optimization evaluations can lead to more reliable and actionable insights. This has implications for both academic research and practical applications in optimization tasks, where understanding the context and conditions of algorithm performance is crucial for decision-making.
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
NLP
Large Language Models
Theory
- Introduces a novel method for hallucination detection in LLMs by treating them as dynamical systems.
- Utilizes a differential error score to distinguish between factual and hallucinated responses in a single pass.
- Achieves state-of-the-art performance across multiple benchmarks with reduced resource requirements.
- Incorporates a calibration mechanism for user-specific detection preferences.
Read more
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
Summary
This paper addresses the challenge of detecting hallucinations in Large Language Models (LLMs), which often generate plausible but factually incorrect content. Traditional detection methods are computationally expensive and rely on external knowledge retrieval or multiple sampling passes. The authors propose a novel approach that treats LLMs as black-box dynamical systems, projecting LLM responses into a high-dimensional manifold. By employing Koopman operator theory, they fit transition operators for factual and hallucinated responses, defining a differential residual score based on prediction errors. This method allows for low-cost hallucination detection in a single-sample pass, significantly reducing resource overhead. The authors also introduce a preference-aware calibration mechanism to optimize classification thresholds based on user requirements. Extensive testing across three benchmarks demonstrates that their method achieves state-of-the-art performance while minimizing computational costs.
Methodology
The authors project LLM responses into a high-dimensional manifold and characterize the resulting vector sequences as observable realizations of the model's latent state-space dynamics. They apply Koopman operator theory to fit transition operators for both factual and hallucinated regimes, defining a differential residual score based on prediction errors. The approach allows for one-sample detection without the need for external knowledge or multiple sampling.
Results
The proposed method outperforms existing black-box hallucination detection techniques across three data benchmarks (FELM, HaluEval, and WikiBio), achieving state-of-the-art results while requiring significantly fewer resources. The differential prediction score effectively captures the transition between factual and hallucinated outputs.
Implications
This research has significant implications for the deployment of LLMs in high-stakes domains such as healthcare, law, and finance, where the reliability of generated content is critical. The low-cost and efficient detection method can enhance the autonomy of LLMs by reducing reliance on external verification.
From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics
Computer Vision
Theory
Interpretability
- Development of a video-to-PDE pipeline for extracting models from dye-plume dynamics.
- Utilization of weak-form regression to mitigate issues with noisy video data.
- Introduction of a robust model selection protocol based on forward-rollout performance.
- The selected PDE model outperforms traditional advection-diffusion models.
Read more
From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics
Summary
This paper presents a novel video-to-PDE pipeline designed to extract continuum models from video recordings of dye-plume dynamics. The authors address challenges associated with uncalibrated image intensity data and the instability of direct numerical differentiation on noisy frames. The pipeline first converts grayscale video data into a normalized scalar field, isolates bulk drift from intrinsic spreading, and identifies an effective transport law using weak-form sparse regression. The methodology includes a robust model selection process that prioritizes geometric admissibility and forward-rollout performance over traditional regression residuals. The resulting reduced model demonstrates superior performance compared to standard advection-diffusion models, retaining a positive Laplacian coefficient and allowing for a Cole–Hopf linearization. The framework effectively illustrates how uncalibrated visual data can lead to compact, predictive, and interpretable continuum models when discovery, calibration, and uncertainty quantification are treated as distinct stages.
Methodology
The methodology involves converting video data into a normalized scalar field, isolating bulk drift using intensity-weighted centroids, and applying weak-form sparse regression to identify transport laws. The model selection process incorporates diagnostics for coefficient robustness, including bootstrap methods and front diagnostics, while an inverse physics-informed network refines coefficients against forward rollouts.
Results
The selected reduced model is a nonlinear-gradient transport law that significantly outperforms traditional advection-diffusion models on held-out frames. It retains a positive Laplacian coefficient and admits a Cole–Hopf reduction to a linear advection-diffusion equation, demonstrating the effectiveness of the proposed pipeline.
Implications
The findings suggest that the proposed framework can be applied to various physical processes captured in video, enabling the extraction of interpretable and predictive models from uncalibrated visual data. This approach has potential applications in fluid dynamics, environmental monitoring, and other fields where visual data is prevalent.
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
Large Language Models
Optimization
Efficient ML
- Developed a decision-theoretic framework for analyzing LLM cascades, linking cost and quality through optimization.
- Characterized the cost-quality frontier as a pointwise envelope over pairwise cascades, significantly reducing costs.
- Established first-order conditions that account for model confidence scores and their impact on expected quality.
- Demonstrated that a pre-generation router can outperform traditional cascade policies, highlighting structural cost advantages.
Read more
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
Summary
This paper addresses the cost-quality tradeoff in deploying large language models (LLMs) through the use of model cascades, where a cheaper model handles low-confidence queries and an expensive model is used for high-confidence ones. The author critiques existing methods that treat the deferral threshold as an empirical hyperparameter, lacking a formal characterization of the cost-quality frontier. To address this, a decision-theoretic framework based on constrained optimization and duality is developed. For two-model cascades, the framework establishes piecewise concavity of the cost-quality frontier and reciprocal shadow prices that link budget and quality constraints. The author characterizes the achievable frontier from a pool of models as the pointwise envelope over pairwise cascades, identifying switching points where optimal pairs change. The framework is validated across five benchmarks with eight models from five providers, revealing that the pairwise envelope outperforms full fixed chains and optimized subsequence cascades. A lightweight pre-generation router also shows superior performance, suggesting that cascade performance is limited by structural costs rather than a lack of intermediate stages.
Methodology
The paper employs a decision-theoretic framework grounded in constrained optimization and duality. It analyzes two-model cascades and derives first-order conditions that reflect the relationship between model confidence scores and expected quality. The framework is validated through empirical experiments on multiple benchmarks.
Results
The pairwise envelope effectively captures the deterministic threshold-cascade frontier, outperforming both full fixed chains and optimized subsequence cascades. The lightweight pre-generation router exceeded the best cascade policy on four out of five datasets, indicating that structural cost limitations primarily affect cascade performance.
Implications
The findings suggest that optimizing model selection strategies can lead to significant cost reductions in LLM deployment. The framework provides a structured approach for practitioners to navigate the cost-quality tradeoff, potentially influencing future research and applications in efficient model deployment.
Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs
Graph Learning
- Introduction of NATD-GSSL framework for robust GSSL on noisy graphs.
- Development of a dual-graph evaluation protocol for assessing GSSL performance.
- Empirical analysis reveals variability in robustness across GSSL methods and GNN architectures.
- Bidirectional GNN architectures are more effective for noisy graphs compared to unidirectional ones.
Read more
Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs
Summary
This paper addresses the robustness of Graph Self-Supervised Learning (GSSL) methods when applied to noisy, text-driven biomedical graphs, a scenario largely unexplored in existing literature. The authors introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that integrates automatic graph construction, refinement, and GSSL for unsupervised term typing. The study employs a dual-graph evaluation protocol, contrasting a noisy graph derived from the MedMentions corpus with a clean reference graph from the Unified Medical Language System (UMLS). The results indicate variability in robustness across different GSSL methods and Graph Neural Network (GNN) architectures. Relation reconstruction is found to be highly sensitive to noise, while feature reconstruction demonstrates greater robustness. The paper reveals that contrastive objectives are less affected by noise but depend on alignment with downstream tasks. Additionally, the architecture of GNNs significantly influences performance, with bidirectional relational message-passing designs outperforming unidirectional ones on noisy graphs. Overall, NATD-GSSL achieves up to a 7% improvement over pretrained language model baselines, providing practical insights for applying GSSL to real-world noisy graphs.
Methodology
The authors developed the NATD-GSSL framework, which combines automatic graph construction, refinement strategies, and GSSL. They implemented a dual-graph evaluation protocol to quantitatively assess the impact of noise by comparing a noisy graph from MedMentions with a clean UMLS graph. Various GSSL methods were evaluated using different GNN architectures and pretext tasks to analyze robustness.
Results
The study found that relation reconstruction is sensitive to noise and requires well-defined schemas, while feature reconstruction is more robust, achieving performance similar to clean graphs. Contrastive learning objectives were less impacted by noise but depended on task alignment. Bidirectional GNN architectures performed better on noisy graphs, while unidirectional architectures excelled on clean graphs. Overall, the NATD-GSSL framework provided a 7% improvement in term typing compared to pretrained language models.
Implications
The findings suggest that GSSL can be effectively applied to real-world noisy graphs, which is crucial for knowledge extraction and ontology construction in various domains. The insights gained from this study can guide future research in enhancing the robustness of GSSL methods and adapting them to noisy data environments.
A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
Optimization
Time Series
NLP
- MetaAdamW integrates self-attention for dynamic modulation of learning rates and weight decay.
- The optimizer uses a meta-learning objective to train the attention module effectively.
- It extends homoscedastic uncertainty weighting with task-specific priorities for better loss balancing.
- MetaAdamW outperforms standard AdamW across multiple tasks, improving performance and reducing training time.
Read more
A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
Summary
This paper introduces MetaAdamW, a novel optimizer that enhances the standard AdamW by incorporating a self-attention mechanism to dynamically adjust learning rates and weight decay for different parameter groups. Traditional adaptive optimizers apply uniform hyperparameters across all parameters, which can lead to suboptimal performance due to the heterogeneous optimization dynamics of different layers. MetaAdamW addresses this by utilizing a lightweight Transformer encoder to compute group-specific scaling factors based on statistical features such as gradient norms and correlations. The authors propose a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap to train the attention module effectively. A significant contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities, allowing for automatic loss balancing guided by domain knowledge. The optimizer was evaluated across five diverse tasks, demonstrating consistent improvements over AdamW in validation loss, accuracy, and perplexity, while also reducing training time in certain scenarios. Ablation studies confirm the effectiveness of the proposed components, showcasing the generalizability of MetaAdamW across various architectures and tasks.
Methodology
MetaAdamW employs a lightweight Transformer encoder to extract features from parameter groups and compute scaling factors for learning rates and weight decay. It utilizes a meta-learning framework with a composite objective that includes gradient alignment, validation loss decrease, and generalization gap. The method also incorporates priority-injected homoscedastic uncertainty weighting to enhance regularization.
Results
MetaAdamW achieved lower validation loss and perplexity in Transformer-based tasks, with improvements of up to 4.26% and 4.12%, respectively, while reducing training time by up to 17.11%. In non-Transformer architectures, it provided accuracy gains of up to 11.08%. The optimizer also effectively mitigated issues related to premature early stopping in various scenarios.
Implications
The development of MetaAdamW suggests that adaptive optimization can be significantly improved by considering the unique dynamics of parameter groups, which could lead to better performance in a wide range of machine learning tasks. This approach may be particularly beneficial in scenarios where training time and convergence are critical factors.
Bandit Learning in General Open Multi-agent Systems
Theory
Optimization
- Introduces a unified framework for bandit learning in open multi-agent systems, addressing limitations of existing models.
- Defines new concepts such as pre-training degree and stability to capture the complexities of dynamic agent populations.
- Develops certified global-UCB learning methodologies with provable regret bounds.
- Demonstrates that regret is influenced by both the pre-training degree of new agents and the stability of the system.
Read more
Bandit Learning in General Open Multi-agent Systems
Summary
This paper addresses the challenges of bandit learning in open multi-agent systems (OMAS), where agents can enter and exit over time, leading to non-stationarity and complex dynamics. Existing bandit learning frameworks often assume a closed system with fixed agents, which limits their applicability in real-world scenarios where agent populations are dynamic. The author formulates a unified open-system bandit problem that accommodates heterogeneous rewards and general agent arrival patterns. New concepts are introduced, such as the pre-training degree of new agents, stability measures, and global dynamic regret, which help capture the complexities of these systems. The paper presents certified global-UCB learning methodologies with provable guarantees, revealing that entry uncertainty impacts regret linearly through the pre-training degree, while in stable regimes, regret is influenced by the time required to identify a persistent optimal arm and the agent patterns. Lower bounds are established to demonstrate the tightness of these dependencies in challenging instances.
Methodology
The paper formulates a generalized bandit problem for open systems, introducing new metrics to quantify the impact of agent dynamics and reward heterogeneity. It develops certified global-UCB learning algorithms with theoretical guarantees on regret bounds, analyzing the effects of agent entry and stability on learning performance.
Results
The results indicate that the regret scales linearly with the pre-training degree of new agents and is governed by the time needed to identify a stable optimal arm in stable regimes. The established lower bounds confirm the tightness of the proposed regret bounds in challenging scenarios.
Implications
This research has significant implications for applications in dynamic environments, such as e-commerce, distributed systems, and agent-based AI systems, where the ability to adapt to changing agent populations and reward structures is crucial for optimizing decision-making and resource allocation.
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
Theory
- The predictive-causal gap is a structural limit in predictive representation learning.
- Optimal encoders often track environmental dynamics rather than system dynamics.
- The gap intensifies with higher dimensionality, leading to significant misalignment.
- Operational grounding can partially suppress the gap but does not fully recover causal fidelity.
Read more
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
Summary
This paper investigates a systematic failure in predictive representation learning, termed the predictive-causal gap. The author demonstrates that across 2695 neural network configurations trained on linear-Gaussian dynamics, the optimal encoder tends to track the environment rather than the intended system. The mean causal fidelity, which measures the fraction of encoder sensitivity allocated to the system's degrees of freedom, is found to be 0.49, with only 2.5% of configurations exceeding 0.70. The gap worsens with increasing dimensionality, with the optimal encoder becoming causally blind at high dimensions while achieving significantly lower prediction errors. The author proves that this phenomenon is not merely an optimization artifact but a structural property of the predictive objective, particularly when the environment's dynamics are slower or less noisy than those of the system. The study includes a large-scale neural network experiment and a nonlinear dynamics analysis, confirming that unconstrained predictors often learn representations dominated by environmental factors. The findings highlight the limitations of predictive objectives in capturing causal structures, suggesting that operational grounding is necessary to mitigate the predictive-causal gap, although it cannot fully recover causal fidelity without clear system-environment boundaries.
Methodology
The author conducts a theoretical analysis proving an impossibility theorem regarding predictive risk minimization, followed by large-scale neural network experiments across various configurations of linear-Gaussian dynamics. Additionally, a nonlinear dynamics analysis is performed using GRU predictors on a Duffing oscillator coupled with a hidden environment.
Results
The study finds that the mean causal fidelity of the optimal neural network encoder is 0.49, with a maximum of 0.91, despite achieving 99.3% lower prediction error than the optimal linear encoder. In high-dimensional settings, causal fidelity collapses to approximately 10^-8, while the predictive-causal gap grows to 92%. Under environment shifts, unconstrained models exhibit a median 1.82× MSE inflation compared to grounded models, which remain at 1.00×.
Implications
The results suggest that predictive objectives without operational grounding fail to approximate causal recovery, which has significant implications for self-supervised representation learning, world models, and the scaling of machine learning paradigms.
Probabilistic Classification and Uncertainty Quantification of Sahara Desert Climate Using Feedforward Neural Networks
Time Series
- Introduction of a probabilistic framework for climate classification using feedforward neural networks.
- Application of the model to the Sahara Desert, utilizing extensive climate data over a 30-year period.
- Comparison of the ANN-based probabilistic classification with traditional K¨oppen-Trewartha classification.
- Identification of significant fluctuations in climate probabilities, contributing to understanding desertification.
Read more
Probabilistic Classification and Uncertainty Quantification of Sahara Desert Climate Using Feedforward Neural Networks
Summary
This paper addresses the limitations of traditional deterministic climate classification systems, specifically the K¨oppen-Trewartha (KT) classification, which fails to account for uncertainties in climate categorization. The authors propose a novel framework utilizing feedforward artificial neural networks (ANNs) for probabilistic climate classification, enhancing the understanding of transitional climate zones. The study focuses on the Sahara Desert, analyzing climate data from over 400,000 space-time locations between 1960 and 1989. The ANN model is trained on the first 11 years of data, and its classification capabilities are evaluated for both short- and long-term stability and accuracy. The results indicate that the probabilistic approach provides a more nuanced understanding of climate zones compared to the KT classification, particularly in identifying areas experiencing significant changes in climate probabilities. Additionally, fluctuation analysis methods are employed to illustrate the temporal evolution of climatic zones, offering insights into broader desertification trends.
Methodology
The authors implemented a feedforward artificial neural network (ANN) for probabilistic climate classification, training the model on climate data from the Sahara Desert. The model's performance was assessed through short- and long-term classification capabilities, and fluctuation analysis methods were used to analyze temporal changes in climate zones.
Results
The ANN model outperformed the traditional K¨oppen-Trewartha classification by providing a probabilistic classification that accounts for uncertainties. It successfully identified transitional climate zones and highlighted areas with significant changes in climate probabilities, offering a more dynamic understanding of climate patterns in the Sahara Desert.
Implications
The findings suggest that probabilistic classification methods can enhance climate science by providing more accurate and uncertainty-aware categorizations. This approach can inform agricultural planning, hydrological studies, and broader climate change assessments, particularly in regions vulnerable to desertification.
Designing a double deep reinforcement learning selection tool for resilient demand prediction
Reinforcement Learning
Time Series
Optimization
- Introduction of a double deep reinforcement learning architecture for dynamic forecasting model selection.
- Development of an average reward-based early stopping technique to reduce training time.
- Empirical evaluation against state-of-the-art methods using diverse datasets.
- Demonstration of the proposed approach's robustness in varying data conditions.
Read more
Designing a double deep reinforcement learning selection tool for resilient demand prediction
Summary
This paper addresses the challenges of demand forecasting in supply chain management, particularly in the context of changing data dynamics due to recent global events. The authors propose a novel architecture utilizing double deep reinforcement learning (DDRL) to automate the selection of forecasting models from a committee of models based on real-time data characteristics. The DDRL agent processes historical demand data and forecasted values through various neural network architectures, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and feedforward neural networks (FFNNs), to enhance decision-making. Additionally, a new early-stopping technique based on average reward convergence is introduced to optimize training time. The proposed approach is empirically evaluated using grocery sales and snack demand datasets, demonstrating its robustness and effectiveness compared to existing state-of-the-art methods. The findings highlight the importance of adaptable forecasting solutions in improving supply chain efficiency and reducing costs associated with inaccurate demand predictions.
Methodology
The methodology involves the design of a double deep reinforcement learning architecture that selects forecasting models based on historical demand and forecasted values. The architecture integrates CNNs, RNNs, and FFNNs for feature extraction. An empirical study is conducted using grocery sales and snack demand datasets to assess the performance of the proposed approach against existing methods.
Results
The experimental results indicate that the proposed DDRL-based selection tool outperforms traditional forecasting methods and automatic model selection techniques, showcasing its effectiveness in adapting to different data characteristics and improving prediction accuracy.
Implications
The findings suggest that implementing adaptive forecasting models can significantly enhance supply chain management by improving demand prediction accuracy, thereby reducing costs associated with overstocking and missed sales opportunities. This research opens avenues for further exploration in automated forecasting solutions across various industries.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
Large Language Models
Optimization
Theory
- DAPRO is the first dynamic budget allocation framework for multi-turn LLM evaluations.
- It provides distribution-free, finite-sample coverage guarantees without requiring conditional independence assumptions.
- The framework yields tighter coverage bounds by scaling with the mean censoring weight.
- Experiments show DAPRO outperforms static allocation methods in terms of coverage and variance.
Read more
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
Summary
This paper addresses the challenge of evaluating large language models (LLMs) in multi-turn conversational settings, particularly focusing on the computational expense involved in predicting rare events like jailbreaks or task completions. The authors introduce Dynamic Allocation via PRojected Optimization (DAPRO), a novel framework that dynamically allocates computational budgets for evaluating LLMs, improving upon static budget allocation methods. DAPRO provides theoretical guarantees for coverage and is capable of yielding tighter bounds on the time-to-event metrics, such as the time-to-unsafe-sampling. The framework adapts censoring times during conversations, optimizing resource utilization and minimizing variance in estimates of evaluation metrics. Experimental results demonstrate that DAPRO consistently achieves better coverage rates and lower variance compared to static methods, making it a significant advancement in the evaluation of LLMs under budget constraints.
Methodology
The authors developed DAPRO, which treats budget allocation as a sequential decision-making process. It dynamically updates censoring times based on ongoing interactions while ensuring compliance with a global budget constraint. The framework employs a novel coverage bound that improves upon previous methods by focusing on the mean censoring weight rather than the worst-case scenario.
Results
Comprehensive experiments across various tasks, including adversarial jailbreaks and toxic content generation, demonstrated that DAPRO achieved coverage rates closer to the nominal level with significantly lower variance compared to static budget allocation methods. This indicates a more efficient use of computational resources and improved reliability in safety evaluations.
Implications
The findings suggest that DAPRO can enhance the evaluation of LLMs in high-stakes applications, such as healthcare and customer service, where safety and utility assessments are critical. The dynamic budget allocation approach may lead to more effective monitoring and risk assessment strategies for LLM deployments.
From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
Time Series
Multimodal
- Introduces DropsToGrid, a Neural Process-based method for rainfall densification.
- Combines temporal sequences from PWS with radar context for improved accuracy.
- Utilizes multi-modal attention and translation-equivariant fusion for effective spatio-temporal reasoning.
- Demonstrates superior performance over traditional and deep learning baselines in real-world evaluations.
Read more
From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
Summary
This paper presents DropsToGrid, a novel Neural Process-based method designed to enhance rainfall estimation by integrating sparse observations from private weather stations (PWS) with spatial context derived from radar data. Traditional rainfall measurement methods often suffer from low resolution and biases, making it difficult to accurately capture localized rainfall dynamics. DropsToGrid addresses these challenges by utilizing a combination of temporal sequences from noisy PWS data and radar-derived spatial information. The model employs multi-scale feature extraction, temporal attention, and multi-modal fusion to produce dense rainfall fields while explicitly quantifying uncertainty. Evaluations on real-world datasets indicate that DropsToGrid significantly outperforms existing operational and deep learning baselines, providing accurate high-resolution rainfall maps even with limited station data and across different regions. This work represents a significant advancement in the field of rainfall estimation, particularly in its ability to handle irregular, noisy inputs and produce well-calibrated uncertainty estimates.
Methodology
DropsToGrid employs Neural Processes to learn a stochastic representation of rainfall from irregular and noisy PWS observations, guided by radar data. The model incorporates multi-scale feature extraction, temporal attention mechanisms, and multi-modal attention for effective integration of spatial and temporal dependencies.
Results
The evaluation results show that DropsToGrid generates accurate high-resolution rainfall estimates and well-calibrated uncertainty maps, outperforming both operational estimators and deep learning models. The model's performance remains robust even with sparse data and in cross-regional scenarios.
Implications
The findings suggest that DropsToGrid can significantly improve rainfall estimation for applications in weather forecasting, water management, and disaster mitigation, particularly in areas with limited observational data. Its ability to quantify uncertainty also enhances decision-making processes in these domains.
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients
Reinforcement Learning
Optimization
Efficient ML
- Introduces NM-PPG, a new method for non-myopic AFA using pathwise policy gradients.
- Utilizes a continuous relaxation of the acquisition process to enable end-to-end optimization.
- Implements a straight-through rollout scheme to improve alignment between training and deployment.
- Stabilizes optimization with entropy regularization and staged temperature sharpening.
Read more
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients
Summary
This paper introduces a novel method for Active Feature Acquisition (AFA) called Non-Myopic Pathwise Policy Gradients (NM-PPG). AFA is crucial in scenarios where acquiring features is costly, and the learner must decide which features to obtain adaptively. The authors formulate AFA as a partially observable Markov decision process (POMDP) and propose NM-PPG to optimize the acquisition policy. The method employs a continuous relaxation of the acquisition process, allowing for pathwise gradients through the entire acquisition trajectory, which mitigates the high variance typically associated with standard score-function policy gradients. Additionally, a straight-through rollout scheme is developed to align training with deployment by combining hard feature acquisitions in the forward pass with soft relaxation in the backward pass. The optimization process is stabilized using entropy regularization and staged temperature sharpening. Experimental results on synthetic and real-world datasets indicate that NM-PPG outperforms existing state-of-the-art AFA methods, demonstrating its effectiveness in achieving a better accuracy-cost trade-off.
Methodology
The authors formulate AFA as a POMDP and introduce NM-PPG, which uses a continuous relaxation of the acquisition process to compute pathwise gradients. They develop a straight-through rollout scheme to reconcile the training and testing phases and employ entropy regularization and temperature sharpening to stabilize the optimization process.
Results
Experiments show that NM-PPG achieves more stable performance than existing non-myopic AFA methods, outperforms myopic methods on datasets with non-myopic structures, and remains consistent with myopic baselines on datasets where myopic acquisition suffices.
Implications
The proposed method can be applied in various fields where feature acquisition is costly, such as medical decision support, recommender systems, and interactive troubleshooting, enabling more efficient and effective predictive modeling.
Investigating Trustworthiness of Nonparametric Deep Survival Models for Alzheimer's Disease Progression Analysis
Theory
Interpretability
Time Series
- Introduction of two novel fairness metrics for nonparametric deep survival models.
- Comprehensive evaluation pipeline for assessing model performance in AD progression.
- Significant bias found in deep survival models concerning sensitive attributes.
- Emphasis on the importance of fairness and interpretability in survival analysis.
Read more
Investigating Trustworthiness of Nonparametric Deep Survival Models for Alzheimer's Disease Progression Analysis
Summary
This paper addresses the critical need for reliable modeling of Alzheimer's Disease (AD) progression using nonparametric deep survival models (NDSMs). The authors highlight the importance of survival analysis in predicting the timing of disease transitions, which is essential for effective patient care. Despite the advancements in deep learning for survival tasks, there is a notable lack of studies focusing on AD, particularly regarding the fairness and bias inherent in these models. The authors propose two novel fairness metrics—Time-Dependent Concordance Impurity and Kaplan-Meier Fairness—to assess bias related to sensitive attributes such as sex, race, and education. They present a comprehensive evaluation pipeline that examines model performance in terms of discrimination, calibration, fairness, and interpretability. The study reveals that while NDSMs can significantly aid clinical decision-making, they often exhibit considerable bias, indicating the need for further research in this area. The findings underscore the importance of incorporating fairness considerations into the development of predictive models for AD progression.
Methodology
The authors developed a rigorous evaluation pipeline to assess the suitability of nonparametric deep survival models for real-world AD progression tasks. This included the introduction of new fairness metrics and an extensive analysis of model bias across various demographic factors. The study utilized real-world clinical data to validate the performance and fairness of the models.
Results
The study found that while nonparametric deep survival models demonstrate strong discriminative performance, they also exhibit significant biases related to sex, ethnicity, and education. The proposed fairness metrics effectively quantified these biases, revealing critical areas for improvement in model development.
Implications
The findings suggest that while deep learning models can enhance clinical decision-making for Alzheimer's care, addressing bias is crucial to ensure equitable treatment across diverse patient populations. The proposed metrics and evaluation framework can guide future research in developing fairer and more reliable predictive models for AD and potentially other medical conditions.
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
Theory
- The memorization versus reasoning outcome in Transformers is determined within a critical training window.
- Weight decay applied during a specific training phase can yield high OOD accuracy comparable to full training weight decay.
- The timing of regularization is more important than its magnitude for achieving reasoning solutions.
- The critical window's position is influenced by initialization scale, with smaller scales leading to a reduced basin of attraction for reasoning.
Read more
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
Summary
This paper investigates the critical windows of complexity control in Transformers, specifically how they determine whether to adopt reasoning or memorization strategies during training. The author identifies that the fate of a Transformer model is not solely dependent on static hyperparameters like initialization scale and weight decay, but rather on the timing of these parameters during training. Through experiments on a controlled compositional task, the study reveals that applying weight decay within a specific training window significantly enhances out-of-distribution (OOD) accuracy. Key findings include that weight decay applied for a single 25% training window can yield comparable OOD accuracy to full training weight decay, and that the timing of regularization is more crucial than its magnitude. The paper also highlights a sharp transition in performance based on the timing of weight decay application, and how the initialization scale affects the critical window's position. The findings suggest a need to revise existing practices for inducing reasoning solutions in Transformers, emphasizing the importance of timing in complexity control.
Methodology
The study employs a controlled compositional task to analyze the effects of weight decay and initialization scale on the training dynamics of small Transformers. It utilizes per-step OOD accuracy measurements and systematic comparisons across multiple training trajectories to characterize the temporal structure of complexity control.
Results
The results indicate that applying weight decay within a defined training window leads to significantly higher OOD accuracy (up to 0.93) compared to applying it outside this window (as low as 0.15). The study also finds that the critical window's onset can be shifted by as little as 100 optimization steps, resulting in a drastic change in performance. Additionally, the research shows that larger initialization scales lead to earlier critical windows, while smaller scales shrink the basin of attraction for reasoning solutions.
Implications
These findings have significant implications for the training of Transformers, suggesting that practitioners should focus on the timing of regularization rather than just its magnitude. This could lead to improved generalization capabilities in compositional tasks and inform future research on the dynamics of learning in deep networks.
Transformed Latent Variable Multi-Output Gaussian Processes
Theory
Efficient ML
Time Series
- Introduction of T-LVMOGP, a scalable framework for MOGPs.
- Utilization of a Lipschitz-regularized neural network for mapping inputs and latent variables.
- Integration of stochastic variational inference for efficient training.
- Demonstrated superior performance in predictive accuracy and efficiency over existing methods.
Read more
Transformed Latent Variable Multi-Output Gaussian Processes
Summary
The paper introduces the Transformed Latent Variable Multi-Output Gaussian Processes (T-LVMOGP), a novel framework designed to enhance the scalability and expressiveness of Multi-Output Gaussian Processes (MOGPs) when dealing with high-dimensional output spaces. Traditional MOGPs face significant computational challenges due to their cubic complexity concerning the number of outputs, which limits their applicability in domains such as climate modeling and spatial transcriptomics. T-LVMOGP addresses these challenges by employing a flexible multi-output deep kernel that maps inputs and output-specific latent variables into an embedding space using a Lipschitz-regularized neural network. This approach allows for the effective capture of intricate inter-output dependencies without imposing rigid structural constraints. The authors utilize stochastic variational inference within the Sparse Variational GP framework, enabling the model to scale efficiently to large datasets. Empirical evaluations demonstrate that T-LVMOGP outperforms existing MOGP baselines in terms of both predictive accuracy and computational efficiency across various benchmarks.
Methodology
The T-LVMOGP framework constructs a multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularized neural network. It employs stochastic variational inference within the Sparse Variational GP framework, allowing for mini-batch training and accommodating non-Gaussian likelihoods. Regularization techniques such as residual connections and spectral normalization are applied to prevent overfitting.
Results
The empirical results indicate that T-LVMOGP significantly outperforms traditional MOGP baselines in predictive accuracy and computational efficiency, particularly in high-dimensional settings such as climate modeling with over 10,000 outputs and zero-inflated spatial transcriptomics data.
Implications
The T-LVMOGP framework has the potential to advance the application of Gaussian Processes in various fields requiring the modeling of complex inter-output dependencies, such as environmental science, genomics, and other domains with high-dimensional output data.
Normalized Architectures are Natively 4-Bit
Large Language Models
Efficient ML
Optimization
- nGPT architecture is natively robust to 4-bit quantization, requiring no additional overhead fixes.
- Robustness arises from coherent signal accumulation rather than noise suppression.
- Training dynamics under the hypersphere constraint promote distributed alignments across dimensions.
- Empirical validation shows lower relative error and stability across diverse model configurations.
Read more
Normalized Architectures are Natively 4-Bit
Summary
This paper presents nGPT, a transformer architecture that constrains weights and hidden representations to the unit hypersphere, demonstrating inherent robustness to low-precision (4-bit) arithmetic. The authors argue that this architectural choice eliminates the need for common interventions like randomized Hadamard transforms and dynamic per-tensor scaling, which are typically required to maintain model quality during low-bit training. The study validates the effectiveness of nGPT on various model sizes, including a 1.2B dense model and hybrid Mamba-Transformer models with up to 30B parameters. The key finding is that the hypersphere constraint enhances the coherence of signal accumulation across dimensions, leading to a higher effective signal-to-noise ratio (SNR) and a flatter loss landscape. This suggests that the architecture itself can be designed to be quantization-ready, shifting the focus from post-hoc quantization fixes to intrinsic architectural robustness.
Methodology
The authors conducted a layer-wise structural analysis on a 3.6B parameter transformer using NVFP4 quantization. They compared the performance of nGPT against standard transformer architectures to assess the impact of the hypersphere constraint on signal coherence and quantization noise.
Results
The results indicate that nGPT maintains stability and achieves lower relative error compared to standard transformers across various architectures, including a 1.2B dense model and hybrid Mamba-Transformer configurations. The hypersphere constraint leads to a higher effective SNR, allowing for more effective signal accumulation during training.
Implications
The findings suggest that future model designs can prioritize architectural features that enhance quantization robustness, potentially leading to more efficient training and deployment of large language models at low precision. This could be particularly valuable as the field moves towards increasingly low-bit arithmetic.
SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation
Reinforcement Learning
Optimization
Theory
- SNAPO integrates a neural policy within differentiable simulators for optimal control.
- The framework allows for the computation of exact gradients in a single adjoint pass.
- SNAPO demonstrates significant speedups in sensitivity analysis compared to traditional methods.
- The approach is validated across three diverse domains with rapid training times.
Read more
SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation
Summary
The paper introduces SNAPO (Smooth Neural Adjoint Policy Optimization), a novel framework designed to optimize sequential decision-making under uncertainty in various real-world applications, such as gas storage, pension fund management, and pharmaceutical manufacturing. SNAPO integrates a neural policy within a known, differentiable simulator, allowing for the replacement of hard constraints with smooth approximations. This enables the computation of exact gradients of the objective with respect to all policy parameters and inputs in a single adjoint pass, significantly improving efficiency over traditional methods like dynamic programming and black-box reinforcement learning. The authors demonstrate SNAPO's effectiveness across three domains, achieving rapid training times and producing multiple sensitivities at no additional cost. The framework's ability to handle high-dimensional state spaces while providing precise gradient information positions it as a powerful tool for optimal control problems.
Methodology
SNAPO employs a differentiable simulation approach, embedding a neural network policy within a known simulator. It replaces hard constraints with smooth approximations and utilizes reverse-mode automatic differentiation to compute gradients efficiently. This allows for simultaneous sensitivity analysis and policy training in a single backward pass.
Results
The results show that SNAPO can train policies in under a minute for gas storage, achieve 6.5x–200x speedup in sensitivity computation for pension fund management, and produce regulatory sensitivity metrics in 74.5 milliseconds for pharmaceutical manufacturing. All sensitivities are generated at a cost proportional to one reverse pass, regardless of the number of sensitivities computed.
Implications
SNAPO has significant implications for industries requiring optimal control under uncertainty, such as energy management, finance, and pharmaceuticals. Its ability to efficiently compute gradients and sensitivities can enhance decision-making processes and regulatory compliance in these fields.
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning
Theory
- Introduction of the Geometric Forgetting Hypothesis, highlighting the loss of geometric information in deep operator architectures.
- Demonstration of systematic geometric information decay through layer-wise probing in spectral and attention-based operators.
- Identification of a structural limitation in transformer-based operators, termed the Geometric Shortcut, which leads to feature collapse under late geometry injection.
- Proposal of a Geometry Memory Injection mechanism that effectively restores geometric information flow with minimal architectural changes.
Read more
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning
Summary
This paper investigates the performance of neural operators on irregular geometries, revealing a critical limitation in their architecture termed the Geometric Forgetting Hypothesis. The authors argue that as the depth of neural operator architectures increases, they progressively lose access to domain geometry due to the Markovian structure of operator layers and their reliance on global mixing mechanisms. This geometric forgetting leads to degradation in accuracy, stability, and generalization capabilities. The authors validate their hypothesis through layer-wise geometric probing, demonstrating systematic loss of geometric fidelity in both spectral and attention-based operators. To address this issue, they propose a lightweight Geometry Memory Injection mechanism that restores geometric constraints at intermediate depths, effectively mitigating the forgetting phenomenon. This intervention highlights a structural requirement for geometric retention in transformer-based operators, emphasizing the necessity of early geometric integration rather than treating it as a mere design choice.
Methodology
The authors employed layer-wise geometric probing and spectral analysis to investigate the propagation of geometric information through deep neural operators. They introduced a Geometry Memory Injection mechanism to counteract geometric forgetting and conducted control studies to differentiate between intrinsic and extrinsic geometric memory.
Results
The study found that both spectral and attention-based neural operators exhibit a systematic loss of geometric fidelity as depth increases, leading to decreased accuracy and generalization. The Geometry Memory Injection mechanism was shown to effectively restore geometric information flow, improving model performance and stability.
Implications
The findings suggest that future neural operator designs must prioritize the retention of geometric information throughout the architecture. This has implications for the development of more robust models capable of generalizing to irregular geometries, which are common in real-world applications such as fluid dynamics and material science.
Bridging Input Feature Spaces Towards Graph Foundation Models
Graph Learning
- Introduces ALL-IN, a method for transferring knowledge across graph datasets with varying input features.
- Utilizes covariance-based statistics to create robust node representations independent of original feature spaces.
- Demonstrates theoretical invariance properties of node-covariance operators to permutations and transformations.
- Achieves strong empirical performance on diverse tasks without the need for architecture changes or retraining.
Read more
Bridging Input Feature Spaces Towards Graph Foundation Models
Summary
This paper addresses the challenge of transferring knowledge across graph datasets with heterogeneous input features, which has hindered the development of graph foundation models. The authors introduce ALL-IN, a novel method that projects node features into a shared random space and constructs representations using covariance-based statistics. This approach allows for the elimination of dependence on the original feature space, making the computed node-covariance operators invariant to permutations and orthogonal transformations of input features. The theoretical foundation of ALL-IN ensures robustness against dimensional mismatches and preserves task-relevant information. Empirical results demonstrate that ALL-IN achieves strong performance across diverse node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining. This work presents a promising direction for creating input-agnostic, transferable graph models, potentially advancing the field of graph learning significantly.
Methodology
The ALL-IN method projects node features into a common high-dimensional space using a stochastic projection matrix. It then computes an empirical node-covariance matrix based on these projections, capturing pairwise node similarities. This covariance matrix serves as a graph operator within a graph neural network (GNN), allowing for robust representation learning that is invariant to changes in feature semantics and dimensionality.
Results
Empirical evaluations show that ALL-IN significantly improves transfer performance across various node- and graph-level tasks on datasets with new input features. The method demonstrates strong generalization capabilities, confirming its effectiveness in overcoming the challenges posed by heterogeneous input features.
Implications
The findings suggest that ALL-IN could facilitate the development of graph foundation models that are more adaptable and capable of generalizing across diverse datasets and tasks. This could lead to broader applications in areas such as social network analysis, biological network modeling, and other domains where graph data is prevalent.
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Generative Models
Multimodal
Computer Vision
- Characterization of Mean Mode Screaming (MMS) and its impact on training stability in ultra-deep DiTs.
- Introduction of Mean–Variance Split (MV-Split) Residuals to address the mean-dominated collapse state.
- Demonstration of MV-Split's effectiveness in preventing collapse and improving convergence rates compared to existing methods.
- Successful training of a 1000-layer DiT, showcasing the architecture's scalability and stability.
Read more
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Summary
This paper addresses a critical issue in scaling Diffusion Transformers (DiTs) to extreme depths, specifically the phenomenon termed Mean Mode Screaming (MMS). MMS leads to a mean-dominated collapse state where token representations become homogenized, suppressing variation and causing training instability. The authors identify the mechanisms behind MMS, revealing that it is triggered by a spike in mean-coherent gradient components, which can occur even during stable training. To mitigate this issue, the paper introduces Mean–Variance Split (MV-Split) Residuals, which allow for a separate gain of centered residual updates while dampening the mean path. This approach stabilizes training without the convergence costs associated with existing isotropic gating methods. The authors validate their method through experiments on a 400-layer DiT, demonstrating that MV-Split prevents collapse events and converges faster than traditional methods. Furthermore, they successfully scale their architecture to a 1000-layer DiT, confirming its stability at extreme depths.
Methodology
The authors conducted mechanistic audits to identify the causes of MMS in ultra-deep DiTs. They proposed MV-Split Residuals, which combine centered residual updates with a leaky trunk-mean replacement. The methodology involved training a 400-layer DiT and a 1000-layer DiT, comparing the performance of MV-Split against traditional isotropic gating methods like LayerScale.
Results
The experiments showed that MV-Split effectively removed collapse events in the 400-layer DiT and converged faster than the baseline methods. In the 1000-layer DiT, the architecture remained stably trainable, validating the proposed method's effectiveness at extreme depths.
Implications
The findings suggest that MV-Split Residuals could be a crucial advancement for training ultra-deep generative models, potentially leading to more robust and scalable architectures in various applications, including text-to-image generation and other multimodal tasks.
MixINN: Accelerating Plant Breeding by Combining Mixed Models and Deep Learning for Interaction Prediction
Optimization
- MixINN combines mixed models and deep learning for improved prediction of genotype-environment interactions.
- The approach addresses the limitations of linear models by capturing nonlinear relationships in crop yield predictions.
- Evaluation on a corn multi-environment trial dataset showed significant improvements in identifying high-yielding genotypes.
- MixINN achieved a 5.8% increase in average yield for the top 20% of corn genotypes, with further improvements in specific environments.
Read more
MixINN: Accelerating Plant Breeding by Combining Mixed Models and Deep Learning for Interaction Prediction
Summary
The paper introduces MixINN, a novel approach that integrates mixed models with deep learning to enhance the prediction of genotype-environment interactions (G×E) in plant breeding. As climate change alters growing conditions, accurately predicting crop performance becomes critical for food security. The authors highlight the limitations of current methods, which often rely on linear models that fail to capture the nonlinear nature of G×E. MixINN addresses this by first using a factor-analytic mixed model to isolate high-quality genetic and environmental effects, followed by deep learning models to predict these effects for new genotypes in varying environments. The method was evaluated using a large-scale corn multi-environment trial dataset from the US, demonstrating significant improvements in predicting the ranking of corn genotypes. MixINN outperformed existing models, achieving a 5.8% increase in average yield for the top 20% of genotypes, which further improved to 7.2% when targeting specific environments. This work showcases the potential of AI in accelerating the development of climate-resilient crops, ultimately contributing to global food security.
Methodology
The methodology involves fitting a factor-analytic mixed model to decompose training samples into genetic effects, environmental effects, interaction effects, and noise. Individual deep learning models are then trained to predict these effects for new genotypes and environments, leveraging the structured correlation induced by the experimental design.
Results
MixINN demonstrated superior performance in identifying the top 20% of productive corn genotypes, leading to a 5.8% higher average yield compared to existing methods. This improvement increased to 7.2% when specifically targeting certain growing environments.
Implications
The findings suggest that MixINN can significantly enhance the efficiency of plant breeding programs, particularly in the context of climate change. By improving the prediction of crop performance, this approach has the potential to accelerate the development of climate-adapted crops, thereby contributing to global food security.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
Large Language Models
Efficient ML
- PBKV predicts future agent invocations to optimize KV-cache management in dynamic workflows.
- The system employs hierarchical eviction and conservative prefetching to enhance cache reuse and mitigate prediction errors.
- PBKV achieves up to 1.85× speedup over LRU and 1.26× over KVFlow in experimental benchmarks.
- The predictor's performance is robust to errors, ensuring graceful degradation.
Read more
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
Summary
This paper addresses the challenge of managing Key/Value (KV) caches in dynamic workflows involving Large Language Model (LLM)-based agents. Existing methods either manage caches at the agent level, missing out on reuse opportunities within workflows, or at the workflow level, assuming static agent sequences. The proposed system, PBKV (Prediction-Based KV-Cache Management), predicts future agent invocations based on historical data and current context, allowing it to estimate the reuse potential of cache entries effectively. PBKV conservatively utilizes these predictions during cache eviction and prefetching to mitigate the impact of prediction errors. Experimental results demonstrate that PBKV achieves significant speedups in workflow latency and cache hit rates compared to traditional methods like LRU and the state-of-the-art KVFlow, particularly in dynamic settings. The framework's design principles, including hierarchical eviction and conservative prefetching, ensure robust performance even with imperfect predictors, making it a promising solution for efficient cache management in multi-agent systems.
Methodology
The authors developed PBKV, which integrates multi-step prediction with complementary signal fusion to forecast agent invocations. The cache management employs hierarchical eviction strategies to reclaim low-value cache entries and conservative prefetching to optimize GPU memory usage. The system was evaluated on dynamic and static workflow benchmarks to assess performance improvements in latency and cache hit rates.
Results
PBKV demonstrated up to 1.85× speedup in average workflow latency and improved KV-cache hit rates by up to 2.55× compared to LRU. In static workflows, PBKV outperformed KVFlow by up to 1.26× in latency and 1.39× in cache hit rates. The predictor's design ensures stable performance even under prediction errors.
Implications
The findings suggest that PBKV can significantly enhance the efficiency of multi-agent systems in real-world applications, where dynamic workflows are common. This could lead to faster processing times and reduced computational costs in various domains utilizing LLMs.
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Computer Vision
Large Language Models
Efficient ML
- Introduction of Delta-Code Generation for NAS, focusing on generating compact diffs instead of full models.
- Significant reduction in output length (75-85%) while preserving competitive accuracy metrics.
- Evaluation of three different LLMs across six diverse datasets, showcasing the robustness of the approach.
- Demonstration of the method's ability to maintain structural integrity and improve existing architectures.
Read more
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Summary
This paper introduces Delta-Code Generation, a novel approach to Neural Architecture Search (NAS) that leverages large language models (LLMs) to generate compact diffs (deltas) for refining existing neural architectures instead of generating complete models from scratch. Traditional methods of NAS are computationally expensive and often produce verbose code. The proposed method fine-tunes LLMs using Low-Rank Adaptation (LoRA) on curated architectures from the LEMUR dataset, employing MinHash-Jaccard novelty filtering to ensure structural diversity. The authors evaluate three 7B-class LLMs—DeepSeek-Coder, Qwen2.5-Coder, and Mistral—across six datasets, demonstrating that their delta-based approach significantly reduces output length (by 75-85%) while maintaining competitive first-epoch accuracy. The results indicate that delta generation is a token-efficient and multi-domain alternative to full-model synthesis, with implications for improving the efficiency of NAS processes.
Methodology
The methodology involves fine-tuning LLMs using LoRA on curated architectures from the LEMUR dataset. The authors utilize MinHash-Jaccard novelty filtering to maintain diversity in the generated architectures. The evaluation is conducted across six datasets using a standardized 22-cycle protocol, comparing the performance of three 7B-class LLMs in generating architectural diffs.
Results
The results show that all three LLMs achieved closely matched metrics, with DeepSeek-Coder achieving a valid generation rate of 75.3% and a mean first-epoch accuracy of 65.8%. The delta-based approach outperformed the full-generation baseline, which had a valid generation rate of 50.6% and a mean accuracy of 42.3%. Specifically, on CIFAR-10, the best first-epoch accuracies were 85.5% for Mistral, 85.2% for DeepSeek, and 80.6% for Qwen, all significantly higher than previous benchmarks.
Implications
The findings suggest that delta-based generation can streamline the NAS process, making it more efficient and less resource-intensive. This approach can be applied across various domains, enhancing the capabilities of LLMs in architecture generation and potentially leading to more effective neural network designs.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
NLP
Large Language Models
Optimization
- MDN introduces a chunkwise parallel algorithm that preserves causality while enhancing training efficiency.
- The model leverages stepwise momentum to improve representation robustness and retrieval performance.
- Extensive experiments show MDN outperforms existing models like Transformers, Mamba2, and GDN.
- The paper provides a novel perspective on momentum-based updates as a second-order dynamical system.
Read more
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
Summary
This paper presents Momentum DeltaNet (MDN), a novel approach to enhance the efficiency of Linear Attention (LA) mechanisms in large language models (LLMs) by integrating a chunkwise parallel algorithm with a stepwise momentum rule. Traditional LA models, while reducing the computational complexity of self-attention, struggle with rapid information decay and suboptimal convergence due to naive Stochastic Gradient Descent (SGD) updates. The authors propose a solution that geometrically reorders update coefficients to maintain strict temporal causality while allowing for efficient parallel computation. By analyzing the momentum-based recurrence as a second-order dynamical system, they introduce stable gating constraints that improve the model's performance. The implementation leverages Triton kernels, achieving training throughput comparable to existing linear models like Mamba2 and KDA. Extensive experiments on models with 400M and 1.3B parameters demonstrate that MDN consistently outperforms strong baselines, including Transformers, across various downstream tasks.
Methodology
The authors develop a chunkwise parallel algorithm for stepwise momentum in Linear Attention, reordering update coefficients geometrically to enable efficient parallel computation. They analyze the momentum-based recurrence as a second-order dynamical system, leading to the design of stable gating constraints. The implementation is optimized using Triton kernels for improved training throughput.
Results
MDN demonstrates consistent performance improvements over strong baselines, including Transformers, Mamba2, and GDN, across various downstream evaluation benchmarks. The experiments conducted on models with 400M and 1.3B parameters show that MDN achieves training efficiency comparable to competitive linear models.
Implications
The findings suggest that integrating momentum-based optimization techniques into Linear Attention can significantly enhance the performance of large language models, potentially leading to more effective and efficient models for long-context tasks in NLP.
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
Computer Vision
Robotics
Interpretability
- Attribution can be used as a predictive signal for planning risk in autonomous driving.
- A novel coarse-to-fine attribution method is proposed for analyzing multi-view inputs.
- Three statistics derived from attribution maps effectively quantify decision-level risks.
- Experiments show strong correlation between attribution statistics and planning risks.
Read more
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
Summary
This paper explores the potential of attribution methods to predict planning risks in end-to-end autonomous driving systems. Traditional approaches to risk assessment in autonomous driving often rely on auxiliary monitoring models or textual explanations, which are decoupled from the planning process and fail to provide visual evidence for trajectory generation. The authors propose a hierarchical attribution framework that utilizes a coarse-to-fine region attribution strategy across six camera views to analyze decision-level risks. They derive three key statistics from the attribution maps—attribution entropy, within-camera spatial variance, and cross-camera Gini coefficient—to quantify reliance on visual evidence and predict planning risks. Experiments conducted on the nuScenes dataset with various autonomous driving models (BridgeAD, UniAD, and GenAD) demonstrate that these statistics correlate with planning risks, achieving significant results in collision detection and trajectory error prediction. The findings suggest that attribution can serve as a predictive signal for planning risk, enhancing the interpretability and safety of autonomous driving systems.
Methodology
The authors developed a hierarchical attribution framework that employs a coarse-to-fine search strategy across six camera views to derive attribution maps. They used L2 trajectory consistency as the objective for attribution and extracted three statistics—attribution entropy, within-camera spatial variance, and cross-camera Gini coefficient—to assess reliance on visual evidence and predict planning risks. The framework was validated using ridge regression for trajectory error and logistic regression for collision prediction on the nuScenes dataset.
Results
The proposed attribution statistics achieved Spearman correlations of 0.30 ± 0.07 with trajectory error and an AUROC of 0.77 ± 0.04 for collision detection across three autonomous driving models. The method demonstrated strong generalization capabilities, maintaining an AUROC of 0.77 ± 0.09 on held-out scenes. Additionally, using an alternative attribution method (RISE) yielded a weaker but consistent signal, with an AUROC of 0.67 ± 0.11.
Implications
The findings suggest that integrating attribution methods into autonomous driving systems can enhance interpretability and safety by providing insights into decision-making processes. This could lead to improved trust in autonomous systems and better risk management strategies in real-world applications.
RVPO: Risk-Sensitive Alignment via Variance Regularization
Reinforcement Learning
Large Language Models
Optimization
- RVPO addresses constraint neglect in multi-objective RLHF by penalizing inter-reward variance.
- The LogSumExp operator is shown to effectively act as a smooth variance penalty.
- RVPO improves performance on HealthBench and maintains accuracy on GPQA-Diamond without late-stage degradation.
- The framework is validated across multiple reward signals and tool-calling scenarios.
Read more
RVPO: Risk-Sensitive Alignment via Variance Regularization
Summary
The paper introduces Reward-Variance Policy Optimization (RVPO), a novel framework designed to address the issue of constraint neglect in multi-objective reinforcement learning from human feedback (RLHF) methods. Traditional critic-less RLHF approaches, which aggregate rewards using arithmetic means, can overlook critical low-magnitude constraints by allowing high-magnitude rewards to mask failures in other objectives. RVPO shifts the focus from merely maximizing the sum of rewards to maximizing the consistency of those rewards by penalizing inter-reward variance. The authors demonstrate that the LogSumExp (SoftMin) operator can serve as an effective smooth variance penalty. The framework is evaluated on two paradigms: rubric-based medical and scientific reasoning with multiple LLM-judged reward signals and tool-calling with rule-based constraints. Results indicate that RVPO significantly improves adherence to bottleneck constraints and overall performance, outperforming existing methods like GDPO while maintaining competitive accuracy.
Methodology
The authors propose RVPO, which modifies the reward aggregation process by incorporating a variance penalty through the LogSumExp operator. This approach allows for a risk-sensitive optimization that prioritizes consistency in reward signals. The methodology is evaluated using two distinct paradigms: one involving rubric-based evaluations with multiple concurrent rewards and another focusing on rule-based tool-calling tasks.
Results
RVPO demonstrated a significant improvement in overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p < 0.001) and maintained competitive accuracy on GPQA-Diamond. The framework effectively mitigated constraint neglect, leading to better adherence to critical low-magnitude constraints while avoiding late-stage training collapse.
Implications
The findings suggest that RVPO can enhance the training of large language models by ensuring that critical constraints are not neglected during optimization. This has potential applications in various fields where multi-objective optimization is crucial, such as healthcare, safety-critical systems, and any domain requiring robust multi-objective alignment.
Hypothesis generation and updating in large language models
Large Language Models
NLP
Theory
- LLMs exhibit a two-parameter Bayesian fit but with systematic biases favoring narrower hypotheses.
- A strong-sampling assumption leads to an implicit Occam's razor effect in hypothesis generation.
- There is a robust evaluation-generation gap, with LLMs selecting more accurate hypotheses during evaluation than during generation.
- LLMs generalize poorly to hypothesis domains not covered by observed examples, indicating limitations in their inference capabilities.
Read more
Hypothesis generation and updating in large language models
Summary
This paper investigates how large language models (LLMs) generate and update hypotheses based on sparse examples, using a controlled setting known as the number game. The study aims to understand the inference capabilities of LLMs and how closely they align with optimal Bayesian reasoning. The number game involves inferring hypotheses from a few positive integers, allowing for the exploration of inductive biases in LLMs. The authors measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. The findings reveal that LLMs often exhibit a two-parameter Bayesian fit with systematic biases, favoring narrower hypotheses due to a strong-sampling assumption. Additionally, there is a notable evaluation-generation gap, where LLMs select more accurate hypotheses during evaluation but generate simpler, rule-like hypotheses. The results indicate that LLMs struggle with generalization beyond observed examples, highlighting limitations in their ability to perform scientific inference effectively.
Methodology
The authors utilize the number game framework to analyze LLM hypothesis generation and updating. They measure the posterior over hypotheses using three probes: posterior prediction, hypothesis evaluation, and hypothesis generation. The study compares LLM behavior with an optimal Bayesian model and human behavior, assessing how hypotheses change with additional examples and how consistent they are across different measurements.
Results
The study finds that LLM predictions align closely with a simple Bayesian fit, but with biases that affect their hypothesis generation. LLMs tend to favor narrower hypotheses and show a significant difference between hypothesis evaluation and generation. When exposed to examples from a limited domain, LLMs struggle to generalize effectively to a broader hypothesis space, indicating a lack of stable latent hypotheses.
Implications
The findings suggest that while LLMs can assist in hypothesis generation and updating, their limitations in generalization and inference may hinder their effectiveness in scientific contexts. Understanding these biases can inform the development of more robust models that better emulate Bayesian reasoning.
Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting
Generative Models
Time Series
Efficient ML
- Introduction of MeLISA, a stochastic autoregressive surrogate model that requires only one function evaluation per forecast block.
- Development of Window-Consistency MeanFlow for non-trivial one-step generative forecasting using masked temporal context.
- Implementation of Time Increment Consistency to enforce long-horizon temporal correlations and mixing behavior.
- Demonstrated superior performance of MeLISA on high-resolution benchmarks compared to traditional neural operators.
Read more
Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting
Summary
This paper presents MeanFlow Long-term Invariant Spatiotemporal Consistency Autoregressive Models (MeLISA), a novel approach for scalable one-step generative modeling in the context of autoregressive dynamical system forecasting. Traditional methods, such as neural operators, often struggle with long-horizon predictions due to error accumulation and distribution shifts, particularly in turbulent regimes. MeLISA addresses these challenges by utilizing a blockwise stochastic transition kernel that allows for efficient forecasting without the need for latent encoders or iterative diffusion solvers. The model incorporates two key innovations: Window-Consistency MeanFlow, which extends single-frame generation to a window-conditioned spatiotemporal generator, and Time Increment Consistency, which enforces long-horizon temporal correlations. Evaluations on high-resolution benchmarks demonstrate that MeLISA outperforms existing neural operator baselines in both short-term accuracy and long-horizon statistical metrics, while maintaining competitive inference speeds.
Methodology
MeLISA employs a blockwise stochastic transition kernel for autoregressive forecasting, integrating Window-Consistency MeanFlow and Time Increment Consistency to enhance long-horizon statistical accuracy. This approach allows for direct generation in pixel space, eliminating the need for latent encoders and multi-step solvers, thus improving rollout efficiency.
Results
MeLISA outperformed neural operator baselines in short-term forecasting accuracy and long-horizon statistical metrics, such as energy spectra and turbulent kinetic energy, on benchmarks including extended 2D Kolmogorov flow and turbulent channel-flow slices. The model achieved comparable or faster inference speeds while maintaining strong parameter efficiency with variants ranging from 3.7 to 150 million parameters.
Implications
The advancements presented in MeLISA could significantly enhance the efficiency and accuracy of simulations in various scientific domains, including fluid dynamics, weather forecasting, and other applications requiring high-dimensional dynamical system modeling.
Validity-Calibrated Reasoning Distillation
NLP
Large Language Models
Efficient ML
- VCRD treats reasoning distillation as local learning-signal allocation rather than trajectory imitation.
- The framework evaluates the local validity of teacher and student proposals to modulate distillation updates.
- VCRD preserves teacher guidance while adapting supervision to the quality of local reasoning.
- The method shows improved performance across mathematical reasoning, code generation, and instruction-following tasks.
Read more
Validity-Calibrated Reasoning Distillation
Summary
This paper introduces Validity-Calibrated Reasoning Distillation (VCRD), a novel framework for transferring multi-step reasoning capabilities from large language models (LLMs) to smaller models. Traditional reasoning distillation methods rely on static teacher-student hierarchies and trajectory imitation, which misaligns with the nature of reasoning where intermediate steps are often under-specified. VCRD reframes reasoning distillation as a problem of local learning-signal allocation rather than strict path alignment. It evaluates the relative local validity of the teacher's and student's proposed next-step actions under the same context, allowing for a dynamic and context-dependent supervision mechanism. This approach preserves the structural guidance of the teacher while adapting the strength of the distillation update based on local reasoning quality. The authors demonstrate that VCRD consistently outperforms existing distillation baselines across various benchmarks, indicating that effective reasoning distillation is better achieved through principled, locally calibrated learning signal allocation rather than rigid imitation.
Methodology
The authors propose a framework that allocates token-level supervision based on the relative local validity of teacher and student proposals. At each decision point, both models generate candidate next steps, which are evaluated by an auxiliary judge. The relative validity of these proposals determines the strength of the distillation update, allowing for three regimes: parity, attenuation, and amplification.
Results
VCRD consistently outperformed strong distillation baselines across multiple benchmarks, demonstrating that effective reasoning distillation relies on locally calibrated learning signals rather than strict trajectory imitation. The empirical analysis revealed that in 28.8% of decision points, the student model achieved equal or higher local rewards compared to the teacher, highlighting the variability in local step quality.
Implications
The findings suggest that reasoning distillation can be significantly improved by focusing on local validity rather than global trajectory alignment. This has implications for developing more efficient and capable AI systems that leverage smaller models without sacrificing reasoning performance.
Knowledge-Free Correlated Agreement for Incentivizing Federated Learning
Federated Learning
Theory
Efficient ML
- KFCA provides a knowledge-free incentive mechanism for federated learning, avoiding the need for ground truth.
- The mechanism is strongly truthful under a categorical-world condition, mitigating vulnerabilities present in previous methods.
- KFCA enables real-time reward computation, making it applicable to decentralized and blockchain-based FL systems.
- Empirical evaluations show KFCA significantly reduces reward computation costs compared to traditional methods like Shapley value estimators.
Read more
Knowledge-Free Correlated Agreement for Incentivizing Federated Learning
Summary
This paper introduces Knowledge-Free Correlated Agreement (KFCA), a novel mechanism designed to incentivize client contributions in federated learning (FL) without the need for ground truth, public test sets, or distribution knowledge. The authors address the limitations of existing methods, particularly the Correlated Agreement (CA) mechanism, which is vulnerable to label-flipping attacks and requires extensive computational resources for report correlation. KFCA operates under a categorical-world condition, ensuring strong truthfulness and eliminating the label-flipping equilibrium of CA. The mechanism allows for real-time reward computation, making it suitable for decentralized and blockchain-based FL systems. The authors evaluate KFCA through federated large language model (LLM) adapter tuning and a real-world printed circuit board (PCB) inspection task, demonstrating its efficiency and effectiveness in environments lacking clear ground truth.
Methodology
The authors propose a multi-task peer-prediction (MTPP) mechanism that rewards clients based on the correlation of their reports without requiring a global estimation of report correlations. KFCA is designed to be strongly truthful, ensuring that honest reporting maximizes expected rewards. The mechanism is evaluated in two instantiations: KFCA-D, which uses a shared public dataset, and KFCA-QP, which relies on quantized parameter updates.
Results
KFCA achieves significantly lower reward-computation costs compared to traditional methods, demonstrating its feasibility in real-world applications. The empirical evaluations confirm that KFCA effectively incentivizes honest reporting and client participation in federated learning tasks, even in the absence of ground truth.
Implications
The development of KFCA has significant implications for the future of federated learning, particularly in decentralized environments where traditional incentive mechanisms may fail. Its applicability to blockchain-based systems could enhance the robustness and scalability of federated learning applications across various domains.
Directional Consistency as a Complementary Optimization Signal: The GONO Framework
Optimization
- Identifies the direction-loss decoupling phenomenon in optimization, where directional consistency does not guarantee loss convergence.
- Introduces GONO, an optimizer that adapts momentum based on directional alignment, improving performance in oscillation detection.
- Proves GONO matches Adam's convergence rate while providing a more effective mechanism for handling gradient direction.
- Empirical validation shows GONO's effectiveness on standard datasets like MNIST and CIFAR-10.
Read more
Directional Consistency as a Complementary Optimization Signal: The GONO Framework
Summary
This paper explores the decoupling of directional alignment and loss convergence in deep learning optimization, revealing that existing optimizers like Adam, SGD, and RMSprop do not effectively utilize temporal consistency in gradient directions. The author introduces GONO (Gradient-Oriented Norm-Adaptive Optimizer), which adapts the momentum coefficient based on the consecutive cosine similarity of gradients, enhancing momentum during directional consistency and reducing it during oscillation. The paper demonstrates that GONO retains Adam's convergence rate while providing a more nuanced approach to optimization by treating directional alignment as a first-class signal. Empirical results show that GONO achieves superior oscillation detection and remains competitive with AdamW across various datasets, establishing the practical benefits of incorporating directional consistency into optimization strategies.
Methodology
The paper employs theoretical analysis to establish the direction-loss decoupling phenomenon and introduces GONO, which modifies Adam's momentum coefficient based on the consecutive cosine similarity of gradients. The methodology includes empirical validation on benchmark datasets to assess the performance of GONO against traditional optimizers.
Results
GONO achieves an F1 score of 1.00 in oscillation detection, significantly outperforming magnitude-based detection methods (F1 = 0.45). It demonstrates competitive performance on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%), confirming its practical applicability in deep learning optimization.
Implications
The findings suggest that incorporating directional consistency into optimization strategies can lead to more effective training processes in deep learning, potentially improving convergence rates and model performance. GONO serves as a complementary tool to existing optimizers, enhancing their capabilities in handling complex optimization landscapes.
Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols
Theory
- Empirical performance gains in PKT models can be sensitive to implementation details and experimental design.
- Improper ordering of student attempts can lead to data leakage and inflated performance estimates.
- A controlled evaluation protocol is proposed to ensure consistent and fair benchmarking of PKT models.
- The performance advantage of attention-augmented models is diminished under controlled settings.
Read more
Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols
Summary
This paper addresses the reliability of Programming Knowledge Tracing (PKT) models, particularly those that utilize attention-based feature modeling combined with RNN-based sequential prediction. The authors argue that while these models have shown strong empirical performance, their reliability is often compromised by subtle implementation details and experimental design choices. They identify specific issues related to attention dimension settings and the ordering of student attempts, which can lead to violations of temporal causality and overly optimistic performance estimates. To tackle these challenges, the authors propose a controlled evaluation protocol that includes a grid search for hyperparameter selection and consistent application across cross-validation folds. They analyze the impact of assignment characteristics and maximum sequence length on model performance. Upon re-evaluating PKT models using the CodeWorkout dataset under these controlled conditions, the authors find that the performance gap between attention-augmented models and traditional DKT is significantly reduced, suggesting that increased architectural complexity does not always yield superior results. The study emphasizes the need for reliable and reproducible evaluation methods in PKT research.
Methodology
The authors conducted a systematic analysis of PKT models by implementing a controlled evaluation protocol. They selected hyperparameters through grid search on a designated fold and applied these settings uniformly across all folds during cross-validation. They also examined the effects of assignment characteristics and maximum sequence length on model performance.
Results
The re-evaluation of PKT models on the CodeWorkout dataset revealed that the performance advantages of attention-augmented models were often overstated when proper experimental protocols were applied. The findings indicated that traditional DKT models could perform comparably or even better than more complex models under specific conditions.
Implications
This work highlights the importance of rigorous evaluation protocols in machine learning research, particularly in educational contexts. It suggests that researchers should be cautious about overinterpreting performance gains attributed to architectural innovations without considering the impact of experimental design.
A Regulatory Governance Framework for AI-Driven Financial Fraud Detection in U.S. Banking: Integrating OCC, SR 11-7, CFPB, and FinCEN Compliance Requirements for Model Development, Validation, and Monitoring Lifecycles
Interpretability
- Introduces the RGF-AFFD framework integrating multiple regulatory compliance requirements for AI in fraud detection.
- Demonstrates the performance of an LSTM+XGBoost ensemble model with a ROC-AUC of 0.9289.
- Addresses the critical gap in existing literature regarding unified regulatory compliance for AI models in finance.
- Provides a Regulatory Digital Twin meta-model for continuous compliance monitoring.
Read more
A Regulatory Governance Framework for AI-Driven Financial Fraud Detection in U.S. Banking: Integrating OCC, SR 11-7, CFPB, and FinCEN Compliance Requirements for Model Development, Validation, and Monitoring Lifecycles
Summary
This paper addresses the fragmented regulatory landscape faced by U.S. financial institutions deploying AI-based fraud detection systems. It introduces the Regulatory Governance Framework for AI-Driven Financial Fraud Detection (RGF-AFFD), a three-tier governance architecture that integrates compliance requirements from the OCC, SR 11-7, CFPB, and FinCEN into a cohesive model development, validation, and monitoring lifecycle. The framework is empirically supported by a multi-study program and validated through performance benchmarks using the IEEE-CIS and ULB datasets. The study benchmarks six architectures, including an LSTM+XGBoost ensemble, achieving a ROC-AUC of 0.9289 and an F1 score of 0.6360, with a benefit-cost ratio of 6:1. The paper emphasizes the importance of regulatory compliance in AI deployment, highlighting that even high-performing models may not be deployable if they fail to meet regulatory standards. The RGF-AFFD is positioned as a replicable governance template that can extend beyond U.S. regulations to international frameworks, offering a pathway for financial institutions to navigate compliance while leveraging advanced AI techniques for fraud detection.
Methodology
The study employs a multi-study empirical program to develop the RGF-AFFD framework, benchmarking six AI architectures on two large datasets. It conducts ablation studies, temporal drift analysis, SHAP interpretability assessments, and fairness evaluations to ensure compliance with regulatory standards.
Results
The LSTM+XGBoost ensemble model achieved a ROC-AUC of 0.9289 and an F1 score of 0.6360, demonstrating strong performance in fraud detection. The model also exhibited a benefit-cost ratio of 6:1, indicating its economic viability. The analysis revealed that network/account-linkage features were significant indicators of fraud, and the framework successfully translates performance metrics into regulatory compliance scores.
Implications
The RGF-AFFD framework provides a structured approach for U.S. financial institutions to deploy AI-driven fraud detection systems while ensuring compliance with complex regulatory requirements. Its design can serve as a model for other countries, facilitating the global adoption of AI in finance while addressing regulatory challenges.
Diversity Curves for Graph Representation Learning
Graph Learning
- Introduction of diversity curves for size-aware graph representation learning.
- Demonstration of improved expressivity through edge contraction coarsening.
- Diversity curves outperform traditional methods in various applications.
- Method provides interpretable and scalable graph representations.
Read more
Diversity Curves for Graph Representation Learning
Summary
This paper addresses the challenge of comparing graph-level representations, particularly when graphs of different sizes are sampled from the same distribution. The authors introduce 'diversity curves', a novel approach that tracks the structural diversity of graphs across coarsening levels, allowing for interpretable and scalable size-aware graph representations. By employing edge contraction coarsening, the authors demonstrate that their method enhances expressivity and leads to more powerful graph-level representations compared to traditional structural descriptors. The utility of diversity curves is showcased through various applications, including clustering and visualizing simulated graphs, distinguishing single-cell graph geometries, comparing molecular graph structures, and characterizing geometric shapes. The results indicate that diversity curves outperform existing unsupervised baseline methods across these tasks, highlighting their effectiveness in handling geometric tasks and structural comparisons.
Methodology
The authors propose a method that utilizes edge contraction coarsening to derive hierarchical graph descriptors, tracking the spread of graphs as a novel isometry invariant. This approach allows for the construction of diversity curves that capture the structural diversity of graphs across different scales, making them interpretable and directly comparable.
Results
The experiments show that diversity curves excel in clustering simulated graph distributions, distinguishing between different types of single-cell graphs, and characterizing geometric graphs across varying sizes. The proposed method consistently outperforms several unsupervised baseline methods, demonstrating its effectiveness in geometric tasks.
Implications
The findings suggest that diversity curves can significantly enhance the analysis and comparison of graph datasets in various fields, including biology (e.g., single-cell analysis) and chemistry (e.g., molecular graph comparison). The method's ability to provide interpretable representations could facilitate better understanding and insights in graph-based applications.
Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning
Reinforcement Learning
Graph Learning
Theory
- Graph-SND provides a sparse aggregation method for measuring behavioral diversity in MARL.
- It reduces the computational cost of SND from quadratic to linear or constant time, depending on the graph structure used.
- The method maintains the semantics of SND while enabling efficient diversity control.
- Empirical results show significant improvements in metric computation time and diversity tracking.
Read more
Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning
Summary
This paper introduces Graph-SND, a novel approach to measuring behavioral diversity in multi-agent reinforcement learning (MARL) systems. Traditional System Neural Diversity (SND) relies on a complete graph to calculate pairwise distances among agents, resulting in a quadratic computational cost relative to the number of agents. Graph-SND addresses this bottleneck by employing a weighted average over the edges of an arbitrary graph, allowing for a more efficient computation. The authors present three regimes for Graph-SND: (1) using a complete graph recovers SND exactly, (2) a fixed sparse graph provides a localized diversity measure with linear cost, and (3) random edge samples yield an unbiased estimator with concentration properties. The paper includes theoretical proofs for various properties of Graph-SND, including recovery, non-negativity, and concentration bounds. Empirical evaluations on the VMAS benchmark demonstrate that Graph-SND can track full SND while significantly reducing computational costs, achieving a 10x reduction in metric time during a 500-iteration PPO run. Additionally, the method shows promise in diversity control applications, maintaining performance while reducing costs by approximately 9.5x. Overall, Graph-SND offers a scalable alternative to traditional SND, enhancing both passive measurement and active diversity control in MARL.
Methodology
The authors define Graph-SND as a sparse aggregation layer that utilizes a weighted average over the edges of a graph to compute behavioral diversity. They analyze the theoretical properties of Graph-SND, including recovery and concentration bounds, and validate the approach through experiments on the VMAS benchmark and PettingZoo environments.
Results
Graph-SND successfully tracks full SND with a 10x reduction in computational cost during a 500-iteration PPO run. In experiments with random d-regular expanders, the method achieves a near-perfect recovery of SND values. Additionally, it demonstrates effective diversity control in DiCo scenarios, preserving performance while significantly reducing per-call metric costs.
Implications
Graph-SND has the potential to enhance the efficiency of multi-agent reinforcement learning systems by providing a scalable method for measuring and controlling behavioral diversity. This can lead to improved robustness and specialization in agent behaviors, making it applicable in various domains such as robotics, game AI, and cooperative systems.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
NLP
Large Language Models
- Memory Inception (MI) is a training-free method for steering LLMs using latent KV banks.
- MI provides a better control-drift trade-off compared to traditional prompting and outperforms CAA.
- The method allows for mid-conversation behavior shifts without rewriting the visible transcript.
- MI improves performance on structured reasoning tasks while significantly reducing KV storage needs.
Read more
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Summary
This paper introduces Memory Inception (MI), a novel method for steering large language models (LLMs) that operates in latent attention space. Unlike traditional prompting, which clutters the model's cache with guidance tokens at every layer, and activation steering, which is less effective for structured reminders, MI selectively injects text-derived key-value (KV) banks at specific layers. This approach allows for more efficient memory usage and better control over the model's behavior. The authors evaluate MI across three scenarios: personality-steering tasks, mid-dialogue behavior shifts, and structured reasoning tasks. MI demonstrates a superior control-drift trade-off compared to prompting and consistently outperforms CAA (Cache-Aware Attention) in various tasks. It also supports mid-conversation behavior updates without altering the visible transcript, achieving high alignment in the Qwen3 model. Furthermore, MI shows significant improvements in structured reasoning tasks like HARDMath and PHYSICS, while reducing KV storage requirements by up to 118 times. Overall, MI positions itself as a powerful tool for persistent and structured guidance in LLMs.
Methodology
The methodology involves encoding reminder content into latent KV banks and selectively attaching these banks at chosen attention layers during the decoding process. This selective allocation allows the model to access relevant reminders without cluttering the entire prompt cache.
Results
MI outperformed traditional prompting and CAA in personality-steering tasks, achieving the best control-drift trade-off. It also excelled in structured reasoning tasks, showing improvements in HARDMath and PHYSICS while reducing KV storage by up to 118 times. Additionally, MI facilitated mid-dialogue behavior shifts effectively.
Implications
The introduction of MI has significant implications for enhancing the control and efficiency of LLMs, particularly in applications requiring persistent guidance, such as virtual assistants, educational tools, and interactive storytelling. Its ability to manage memory usage effectively opens new avenues for deploying LLMs in resource-constrained environments.
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Large Language Models
Reinforcement Learning
Theory
- Introduction of the Matrix-Decoupled Concentration (MDC) framework to address concentration bounds in autoregressive sequences.
- Establishment of a McDiarmid-type inequality that prevents scalar collapse and guarantees dimension-free variance proxies.
- Demonstration of the framework's ability to recover optimal transport constants for Markov chains and establish bounds for causal trees.
- Proof of stability in long-context generation for LLMs by preserving coordinate-wise sparsity of sensitivity vectors.
Read more
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Summary
This paper addresses the challenges of establishing tight concentration bounds for autoregressive sequences in Large Language Models (LLMs), particularly in the context of sparse long-context rewards. The author identifies two main bottlenecks in existing frameworks: the issue of scalar collapse, which inflates variance proxies to suboptimal levels, and the mismatch of causal structures that renders certain methods inapplicable to autoregressive settings. To overcome these challenges, the paper introduces the Matrix-Decoupled Concentration (MDC) framework, which utilizes a causal interdependence matrix to maintain the structural integrity of dependencies and sensitivities. This framework allows for the derivation of a McDiarmid-type inequality that guarantees dimension-free O(1) variance proxies for sparse rewards, thus providing a rigorous mathematical foundation for evaluating autoregressive LLMs and enhancing the stability of long-context reasoning.
Methodology
The methodology involves the development of the MDC framework, which encodes the dependency structure into a causal interdependence matrix. The author derives a McDiarmid-type inequality that uses matrix-vector multiplication to determine variance proxies, thereby avoiding the scalar collapse seen in classical methods. The framework is applied to both homogeneous Markov chains and causal trees to demonstrate its generality and sharpness.
Results
The MDC framework successfully provides dimension-free O(1) variance proxies for sparse long-context rewards, eliminating the O(N) inflation typical of classical bounds. It also recovers optimal constants for Markov chains and establishes order-optimal bounds for causal trees, thereby enhancing the theoretical understanding of autoregressive LLMs.
Implications
The findings have significant implications for the design and evaluation of autoregressive models in machine learning, particularly in applications involving long-context reasoning and sparse rewards. The MDC framework can improve the stability and reliability of LLMs, making them more effective in practical applications such as natural language processing and reinforcement learning.
Bilinear Mamba-Koopman Neural MPC for Varying Dynamics
Optimization
Reinforcement Learning
Robotics
- Introduces Bilinear Mamba-Koopman Neural MPC to address limitations of existing Koopman-based models.
- Allows for control-dependent coupling in latent dynamics, enhancing adaptability to time-varying conditions.
- Maintains convexity while adding minimal parameters and enabling efficient Sequential Convex Programming.
- Empirical results show improved forecasting accuracy and training stability in time-varying environments.
Read more
Bilinear Mamba-Koopman Neural MPC for Varying Dynamics
Summary
This paper introduces the Bilinear Mamba-Koopman Neural Model Predictive Control (MPC) framework, which enhances the traditional Koopman-based neural MPC by allowing control-dependent coupling in latent dynamics. The authors argue that existing models, while preserving convexity, limit adaptation to time-varying dynamics due to a conditional independence constraint that prevents the system operator from depending on current control inputs. The proposed bilinear extension retains the advantages of the original model while introducing minimal additional parameters (less than 1%) through a low-rank structure. This allows for exact model Jacobians, facilitating efficient Sequential Convex Programming (SCP) with guaranteed convergence under standard assumptions. Empirical evaluations on benchmarks such as CartPole and RSCP demonstrate that the bilinear model consistently matches or improves forecasting accuracy, stabilizes training in time-varying conditions, and shows graceful degradation under stale-plan execution. The findings suggest that incorporating control-dependent latent dynamics significantly enhances the robustness of MPC in varying operational conditions.
Methodology
The authors propose a bilinear extension of the Mamba-Koopman dynamics, which allows the effective operator to adapt to current control inputs. This is achieved through a low-rank parameterization that maintains the model's efficiency. The methodology includes the development of an iterative MPC algorithm that utilizes exact model Jacobians for Sequential Convex Programming, ensuring convergence and stability.
Results
The empirical evaluation reveals that the bilinear model performs comparably or better than existing models in forecasting accuracy across various scenarios. In particular, it shows significant gains in the RSCP time-varying task, stabilizes training, and maintains performance under stale-plan execution, outperforming traditional models in robustness.
Implications
The findings suggest that the proposed bilinear approach can be effectively applied in industrial control systems where dynamics are subject to change, enhancing the performance and reliability of real-time control applications. This could lead to advancements in fields such as robotics, chemical processing, and power systems.
Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability
Generative Models
Theory
Optimization
- Introduces a Feynman-Kac framework to analyze bias and stability in diffusion-based posterior samplers.
- Derives an exact bias formula for DPS, identifying regions of over- and under-sampling.
- Reinterprets STSL as a corrective measure to reduce bias by guiding trajectories toward low-uncertainty areas.
- Quantifies the instability in low-temperature regimes and characterizes early guidance-stopping as a heuristic.
Read more
Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability
Summary
This paper investigates the theoretical underpinnings of diffusion-based posterior samplers, particularly focusing on their bias and stability in the context of inverse problems. The authors introduce a Feynman-Kac analysis to characterize the bias inherent in these samplers, even when using exact prior scores. They derive a pointwise Radon-Nikodym weight that relates the distribution produced by the Diffusion Posterior Sampling (DPS) method to the true posterior, revealing areas of over- and under-sampling. The paper also discusses the STSL sampler as an auxiliary drift that helps mitigate bias by steering trajectories toward low-uncertainty regions. Additionally, the authors quantify the instability issues associated with low-temperature regimes in DPS, which often necessitate early guidance-stopping techniques. By providing a unified analysis, the paper clarifies the sources of bias, explains existing corrective measures, and offers insights for designing stable variants of diffusion-based samplers.
Methodology
The authors employ a Feynman-Kac representation to derive a Radon-Nikodym weight that connects the DPS output to the true posterior. They analyze the spatially varying reaction term of the DPS to identify bias and propose an auxiliary drift for the STSL sampler to mitigate this bias. The stability of the DPS is examined through the lens of numerical analysis, particularly focusing on the forward-Euler integration method.
Results
The analysis reveals that the DPS method exhibits systematic bias even under ideal conditions, with specific regions identified where the sampling is skewed. The introduction of the STSL auxiliary drift is shown to effectively reduce this bias. Furthermore, the paper quantifies the instability in low-temperature scenarios, demonstrating that early guidance-stopping can be interpreted as a weighted prior adjustment.
Implications
The findings have significant implications for the design and application of diffusion-based samplers in various fields, including generative modeling and inverse problems in scientific and medical contexts. By understanding and mitigating bias and instability, practitioners can improve the reliability and accuracy of sampling methods.
Exact Dual Geometry of SOC-ICNN Value Functions
Optimization
Theory
- SOC-ICNNs provide an exact value-function representation for second-order cone programs.
- The dual viewpoint allows for the recovery of geometric properties such as subgradients and local curvature.
- The paper presents a structured readout mechanism for extracting first-order information from dual solutions.
- Numerical experiments confirm the effectiveness of the proposed methods and their applicability in real-world scenarios.
Read more
Exact Dual Geometry of SOC-ICNN Value Functions
Summary
This paper investigates the exact first-order and local second-order geometry of Second-Order Cone Input Convex Neural Networks (SOC-ICNNs) from a dual perspective. SOC-ICNNs enhance traditional ReLU-based Input Convex Neural Networks (ICNNs) by incorporating quadratic and conic modules, allowing for an exact representation as value functions of second-order cone programs (SOCPs). The authors derive geometric primitives such as supporting slopes, subdifferentials, directional derivatives, and local Hessians directly from optimal dual variables, providing a more explicit and analytically tractable approach to SOC-ICNN inference compared to traditional black-box automatic differentiation methods. The paper also includes numerical experiments that validate the derived formulas and a tutorial on implementing a complete white-box inference loop. The findings highlight the potential for SOC-ICNNs to facilitate more precise optimization and inference processes in machine learning applications.
Methodology
The authors develop a dual representation for SOC-ICNNs, focusing on extracting geometric information from optimal dual solutions. They analyze the first-order and second-order properties of the networks, addressing issues of structural degeneracy and local curvature through a series of mathematical derivations and numerical validations.
Results
The study successfully demonstrates that first-order information can be extracted directly from dual solutions, and establishes a clear relationship between the dual structure and the geometric properties of SOC-ICNNs. The results include a unified dual representation, exact characterization of subdifferentials, and closed-form expressions for gradients and Hessians in nondegenerate regions.
Implications
The findings have significant implications for the design and application of convex neural networks, particularly in optimization tasks where precise geometric information is crucial. The ability to perform white-box inference can enhance the interpretability and reliability of neural network models in various domains, including control systems and decision-making processes.
Adaptive Learning Strategies for AoA-Based Outdoor Localization: A Comprehensive Framework
Theory
Optimization
Efficient ML
- Proposes an adaptive framework for AoA-based localization suitable for varying dataset sizes.
- Achieves 100% accuracy in distinguishing LoS and NLoS regions using a hierarchical offline learning approach.
- Implements an online learning strategy that maintains high accuracy with small datasets and low forgetting rates.
- Demonstrates the potential for robust localization in outdoor wireless environments with low-latency solutions.
Read more
Adaptive Learning Strategies for AoA-Based Outdoor Localization: A Comprehensive Framework
Summary
This paper presents a comprehensive framework for angle-of-arrival (AoA)-based outdoor localization, particularly in the context of 5G/B5G and 6G networks. The authors highlight the importance of localization for applications such as intelligent transportation and smart cities, and they address the challenges posed by varying dataset sizes for training localization models. The proposed framework includes two adaptive learning strategies: one for large datasets, employing a hierarchical offline learning approach that distinguishes between line of sight (LoS) and non-line of sight (NLoS) regions, and another for small datasets, utilizing an online learning framework with incremental tree-based models and few-shot learning techniques. The offline learning strategy achieves high accuracy rates, including 100% in distinguishing LoS and NLoS regions and 99.82% for trajectory classification in LoS. The online learning strategy, using the aggregated Mondrian Forest, achieves approximately 94% accuracy for trajectory classification in both regions with low forgetting rates. The results demonstrate that robust localization can be achieved in outdoor environments with low latency and minimal need for extensive dataset collection.
Methodology
The methodology consists of two main strategies: an offline learning framework for large datasets that employs hierarchical classification to differentiate between LoS and NLoS regions, and an online learning framework for small datasets that utilizes incremental tree-based models and few-shot learning to adapt to new classes with limited data.
Results
The offline learning framework achieved 100% accuracy in distinguishing LoS and NLoS regions, 99.82% accuracy for trajectory classification in LoS, and approximately 98% in NLoS. The online learning framework, using the aggregated Mondrian Forest, achieved around 94% accuracy for trajectory classification in both regions with low forgetting rates ranging from 0.0248 to 0.0427.
Implications
The findings suggest that adaptive learning strategies can significantly enhance localization accuracy in dynamic outdoor environments, making them applicable for critical use cases in smart cities and autonomous systems. This approach reduces the reliance on large datasets, facilitating more efficient deployment of localization technologies.
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
Optimization
- HUGO-CS contains 4,383 cold-spray experiments, a 30x increase over previous datasets.
- The HUGO framework combines LLM-based automated extraction with manual refinement for accuracy.
- A Hierarchical Risk Mitigation strategy is implemented to balance labeling efficiency and accuracy.
- The dataset includes extensive post-processing to standardize and normalize data for usability.
Read more
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
Summary
The paper introduces HUGO-CS, a comprehensive dataset comprising 4,383 cold-spray experiments derived from 1,124 sources, significantly surpassing previous datasets in both size and feature richness. Cold spraying is a solid-state manufacturing process that requires careful optimization of numerous interdependent parameters, yet existing literature lacks a large-scale, machine-readable dataset to facilitate this. The authors developed a Hybrid-labeled, Uncertainty-aware, General-purpose, Observational (HUGO) extraction framework that combines automated labeling using Large Language Models (LLMs) with manual refinement to ensure accuracy. The framework employs a Hierarchical Risk Mitigation (HRM) strategy to manage labeling efficiency and accuracy, allowing for high-risk outputs to be reviewed manually while low-risk outputs are auto-labeled. Post-processing steps include the consolidation of categorical descriptors and normalization of units, resulting in a high-quality subset of 1,765 hand-labeled experiments. The dataset is intended for benchmarking, error analysis, and enhancing predictive modeling in cold spray applications, and is made publicly available under a CC-BY license.
Methodology
The authors developed the HUGO extraction framework, which integrates automated labeling via Large Language Models with manual review processes. This hybrid approach is designed to efficiently extract and label data from scientific literature while minimizing errors. The framework also includes a Hierarchical Risk Mitigation strategy to prioritize manual review of high-risk outputs and standardizes data through extensive post-processing.
Results
The HUGO-CS dataset comprises 4,383 experiments with 144 features, providing a robust resource for cold spray research. Of these, 1,765 experiments are hand-labeled, ensuring high-quality data for benchmarking and analysis. The dataset's comprehensive nature allows for improved modeling and understanding of cold spray processes.
Implications
HUGO-CS serves as a foundational dataset for researchers in the field of cold spray technology, enabling enhanced process optimization, predictive modeling, and meta-analysis. Its availability can facilitate advancements in manufacturing techniques and contribute to the development of more efficient cold spray applications.
On the Architectural Complexity of Neural Networks
Theory
Efficient ML
- Introduces a hierarchical combinatorial framework for neural networks that models tensor operations explicitly.
- Analyzes the evolution of architectural complexity in DNNs over the past 40 years, revealing connections between architecture breakthroughs and complexity increases.
- Identifies unexplored classes of higher complexity architectures and provides a dataset of 3,028 novel architectures.
- Demonstrates that new architectures can achieve high efficiency with fewer parameters compared to existing models.
Read more
On the Architectural Complexity of Neural Networks
Summary
This paper introduces a unified theoretical framework for the rigorous analysis and systematic construction of deep neural networks (DNNs), addressing a gap in existing theories by explicitly modeling the structure of tensor operations. The framework allows for two main objectives: analyzing the evolution of architectural complexity in DNNs over the past 40 years and automatically constructing novel architectures based on new types of tensor operations. The authors find a correlation between groundbreaking architectures and increases in architectural complexity, identifying several unexplored classes of higher complexity architectures. They compile a dataset of over 3,000 higher complexity architectures, which they publicly release, demonstrating that many of these architectures exhibit remarkable parameter and depth efficiency, outperforming existing models like MobileNetV2 with significantly fewer parameters.
Methodology
The authors develop a hierarchical combinatorial framework that reinterprets multidimensional arrays as specific types of rank 3 cells, which are built from a base set of real-valued variables. They analyze existing architectures and systematically generate new architectures using novel tensor operations, leveraging their framework to explore the architectural complexity space.
Results
The analysis reveals that significant architectural innovations correspond to increases in various types of architectural complexity. The generated dataset of 3,028 architectures shows that many of these new designs outperform established models like MobileNetV2 in terms of efficiency, using only 10% of the parameters.
Implications
This work has implications for the design and optimization of neural network architectures, potentially leading to more efficient models that can be applied across various domains in machine learning. The framework and dataset can serve as a foundation for future research in neural architecture search and the exploration of higher complexity models.
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Theory
Efficient ML
Federated Learning
- Two RHTs provide a uniform O(d−1/2) approximation to Gaussian distributions for scalar quantization.
- The performance of modern quantization schemes like DRIVE and QUIC-FL can be improved using RHTs.
- Three RHTs are necessary for effective Vector Quantization to ensure weak correlation among coordinate blocks.
- A linear-time check allows for dynamic adjustment of RHT usage based on input characteristics.
Read more
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Summary
This paper addresses the limitations of using Uniform Random Rotations (URRs) in quantization techniques, particularly in the context of gradient compression and model weight quantization. The authors propose the use of Randomized Hadamard Transforms (RHTs) as a more efficient alternative, demonstrating that composing two RHTs on any d-sized input vector results in marginal distributions that closely approximate a standard Gaussian distribution. This is quantified in terms of Kolmogorov and 1-Wasserstein distances. The paper further explores the application of these results to modern compression schemes like DRIVE and QUIC-FL, showing that two RHTs can achieve performance comparable to URRs. However, for Vector Quantization (VQ), the authors prove that three RHTs are necessary to ensure weak correlation among coordinate blocks, which is crucial for effective quantization. Additionally, the authors introduce a linear-time check for input moments, allowing for dynamic adaptation of the number of RHTs used based on the input characteristics, thus optimizing performance without sacrificing theoretical guarantees.
Methodology
The authors analyze the performance of RHTs by composing them and studying the resulting distributions of the transformed vectors. They employ probabilistic inequalities (Berry-Esseen) to establish bounds on the convergence of these distributions to Gaussian forms. The analysis is modular, separating general results from specific applications to quantization schemes.
Results
The paper establishes that two RHTs yield a significant improvement in the distribution of scalar quantization, matching the performance of URRs. For VQ, the necessity of three RHTs is proven to ensure decorrelation among coordinate blocks. The authors also provide a method for dynamically determining the number of RHTs needed based on input characteristics, enhancing computational efficiency.
Implications
The findings suggest that RHTs can be effectively utilized in various quantization applications, particularly in federated learning and distributed systems, where efficient computation is critical. The ability to dynamically adjust the number of RHTs based on input data can lead to more robust and efficient quantization methods.