AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
58
Papers today
8h
Update frequency
7
Days of history
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
NLP
Large Language Models
- CALIDIST introduces a behavior-centric approach to calibrating LLMs by measuring their stability against distractions.
- The method quantifies prediction instability and confidence stability to adaptively scale confidence scores.
- Extensive experiments show that CALIDIST outperforms traditional calibration methods, achieving a significant reduction in ECE.
- The findings suggest that a model's susceptibility to distractions is a strong predictor of its accuracy.
Read more
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
Summary
The paper introduces CALIDIST, a novel post-hoc calibration method for Large Language Models (LLMs) that emphasizes the importance of behavioral robustness to distractions in assessing model confidence. Traditional calibration methods often fail to account for how a model's predictions change when faced with irrelevant or misleading information. CALIDIST addresses this by quantifying the stability of a model's predictions under cognitive pressure, specifically by measuring how predictions and confidence scores are affected by semantic distractors. The authors conducted extensive experiments across seven Natural Language Understanding benchmarks using six different LLMs, demonstrating that CALIDIST significantly reduces Expected Calibration Error (ECE) and Brier Score compared to existing methods. The findings reveal that behavioral stability is a crucial indicator of calibration quality, with CALIDIST achieving an average ECE reduction from 23% to 7%, marking a 70% relative improvement. The paper highlights the need for calibration methods that consider a model's susceptibility to distractions, linking this behavior to its overall reliability and correctness.
Methodology
CALIDIST systematically perturbs input prompts with targeted semantic distractors to measure changes in predictions and confidence scores. It aggregates these behavioral signals to adaptively adjust the model's initial confidence, focusing on two key metrics: Prediction Instability (how often predictions change) and Confidence Stability (changes in predictive uncertainty).
Results
The application of CALIDIST led to a consistent reduction in Expected Calibration Error (ECE) from an average of 23% to 7%, representing a 70% relative improvement. The method also demonstrated lower Brier Scores across the tested benchmarks, indicating enhanced calibration performance compared to established baselines.
Implications
The findings suggest that incorporating behavioral robustness into calibration methods can significantly improve the reliability of LLMs in high-stakes applications. This approach could lead to safer and more trustworthy AI systems, particularly in domains where accurate confidence estimation is critical.
Non-Negative Matrix Factorization for Event Data
Time Series
- EventNMF operates directly on continuous-time event data without preprocessing, preserving fine-grained temporal features.
- The model utilizes a Poisson process framework with non-negative B-spline basis for intensity factorization.
- Efficient parameter estimation is achieved through multiplicative updates.
- EventNMF is validated on synthetic and real-world datasets, demonstrating its effectiveness in various applications.
Read more
Non-Negative Matrix Factorization for Event Data
Summary
This paper introduces EventNMF, a novel continuous-time non-negative matrix factorization (NMF) model designed specifically for event data, which consists of instantaneous events emitted by entities over time. Traditional applications of NMF to event data typically involve preprocessing steps such as binning or smoothing, which can obscure important temporal features and entity-level heterogeneities. EventNMF addresses these issues by modeling each entity's events as a Poisson process, where the intensity function is represented as a non-negative linear combination of latent temporal factors, decomposed using a non-negative B-spline basis. The paper provides a mathematically principled framework that is easy to implement and computationally efficient. The authors derive efficient multiplicative updates for parameter estimation and demonstrate the effectiveness of EventNMF through evaluations on synthetic datasets and real-world applications, including neuronal spike train recordings, earthquake event catalogs, and social interaction data in a primary school setting. The results indicate that EventNMF can uncover interpretable temporal patterns without the biases introduced by traditional binning methods.
Methodology
The methodology involves modeling event data as Poisson processes, where the intensity of events is represented as a non-negative linear combination of latent factors. These factors are decomposed using a non-negative B-spline basis, allowing for the direct analysis of event times without prior binning or smoothing. The authors derive efficient multiplicative updates for estimating model parameters.
Results
The results show that EventNMF successfully identifies interpretable temporal patterns in synthetic datasets with known latent factors and performs effectively on real-world applications, including neuronal activity analysis, earthquake event detection, and social network interactions. The model outperforms traditional binned-count approaches, highlighting its ability to retain essential temporal information.
Implications
The implications of this research extend to various fields such as neuroscience, seismology, and social network analysis, where understanding temporal patterns in event data is crucial. EventNMF provides a robust tool for researchers to analyze complex event-driven datasets without the drawbacks of traditional preprocessing methods.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
Theory
- Smoothly activated DNNs provide stronger uniform convergence guarantees compared to ReLU networks.
- The paper establishes the first theoretical lower bound for the uniform convergence of ReLU FNNs, demonstrating their limitations.
- A comprehensive theoretical framework for smooth DNNs is developed, including pseudo-dimension bounds and approximation guarantees.
- Uniform convergence rates are derived for smooth DNN estimators across various statistical contexts.
Read more
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
Summary
This paper presents a theoretical framework addressing the uniform convergence of deep neural networks (DNNs) with smooth activations, highlighting the limitations of standard ReLU networks. While ReLU networks achieve optimal rates in the L2(P) norm for nonparametric regression tasks, they suffer from the curse of dimensionality in uniform convergence. The authors establish a theoretical lower bound demonstrating this limitation and propose smoothly activated DNNs as a solution. The paper introduces novel pseudo-dimension bounds, non-asymptotic approximation guarantees, and H"older-norm bounds for these models. The results indicate that smooth DNNs can effectively mitigate the curse of dimensionality by leveraging the low-dimensional hierarchical structure of target functions. The findings are supported by simulation studies and a real-world application, positioning smooth DNNs as a robust alternative to ReLU networks for statistical learning tasks requiring uniform guarantees.
Methodology
The authors analyze the uniform convergence of smoothly activated DNNs by establishing theoretical bounds and guarantees. They derive pseudo-dimension bounds and approximation error bounds for Sobolev functions and hierarchical composition models. The analysis includes both feedforward and residual architectures, providing a comprehensive framework for understanding the uniform convergence properties of smooth DNNs.
Results
The paper demonstrates that smooth DNNs can achieve non-asymptotic uniform convergence rates that mitigate the curse of dimensionality. The theoretical framework includes bounds for pseudo-dimension and approximation errors, leading to uniform convergence guarantees for various regression tasks, including Huber, least-squares, quantile, and logistic regression.
Implications
The findings suggest that smoothly activated DNNs are a viable alternative to traditional ReLU networks, particularly in applications requiring reliable uniform convergence. This has implications for statistical learning tasks, individualized treatment recommendations, and the construction of valid confidence bands.
Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Large Language Models
Efficient ML
NLP
- Tangram addresses inefficiencies in multi-turn LLM serving caused by non-uniform KV caches.
- The system employs three core techniques to optimize memory management and scheduling.
- Experimental results show a throughput improvement of up to 2.6× without sacrificing model accuracy.
- Tangram's implementation is publicly available, promoting further research and application.
Read more
Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Summary
The paper introduces Tangram, a novel serving system designed to enhance the efficiency of multi-turn Large Language Model (LLM) serving by addressing the challenges posed by non-uniform Key-Value (KV) caches. As multi-turn interactions require the retention of extensive dialogue history, the linear growth of KV caches leads to significant GPU memory and bandwidth constraints. Traditional uniform KV compression methods often compromise accuracy, while non-uniform approaches, though superior, introduce systemic inefficiencies such as memory fragmentation and scheduling complexities. Tangram tackles these issues through three innovative techniques: Deterministic Budget Allocation, which assigns static memory footprints to attention heads to eliminate dynamic scheduling overhead; Head Group Page, which clusters attention heads with similar retention needs into independent page tables to maximize memory reclamation; and Ahead-of-Time Load Balancing, which precomputes optimal GPU workload distributions to ensure uniform utilization. Experimental results demonstrate that Tangram achieves up to 2.6× throughput improvement over existing systems while maintaining model accuracy, making it a significant advancement in LLM serving efficiency.
Methodology
Tangram employs a holistic framework that integrates deterministic memory scheduling, decoupled paging architecture, and ahead-of-time load balancing. It utilizes stable KV cache retention patterns to optimize memory allocation and scheduling, thereby reducing overhead and improving GPU utilization.
Results
Tangram demonstrated a throughput increase of up to 2.6× compared to existing LLM serving systems while fully preserving the accuracy of the model. The implementation effectively mitigated issues related to memory fragmentation and scheduling complexities.
Implications
The advancements presented in Tangram could significantly enhance the scalability and efficiency of AI assistants and other applications requiring multi-turn interactions, enabling more responsive and resource-efficient LLM serving in real-world scenarios.
Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning
Theory
Efficient ML
Computer Vision
- Introduces BiCyc, a bidirectional cycle consistency approach for EFCIL.
- Addresses systematic bias in existing one-directional projection methods.
- Proves that cycle loss minimizes classification perturbations and stabilizes old-class decisions.
- Demonstrates substantial improvements in accuracy and reduction of forgetting in EFCIL benchmarks.
Read more
Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning
Summary
This paper addresses the challenge of exemplar-free class-incremental learning (EFCIL), where models must learn new tasks without retaining past data, leading to representation drift and catastrophic forgetting. The authors propose a novel approach called BiCyc, which utilizes bidirectional projector alignment with a cycle-consistency objective. This method optimizes two mappings—one from old to new classes and another from new to old—during task training, allowing for co-evolution of transport and representation. Theoretical analysis demonstrates that minimizing the cycle loss stabilizes classification log-odds and reduces bias towards recent classes. Empirical results show that BiCyc significantly mitigates forgetting and enhances accuracy across standard EFCIL benchmarks, outperforming existing methods while maintaining competitive performance in pretrained settings.
Methodology
The BiCyc method involves learning two projections—A (old to new) and D (new to old)—with stop-gradient gating to prevent retroactive updates. A cycle-consistency objective is employed to ensure that the mappings behave nearly inversely on the data support, thus addressing the issues of asymmetry and post-hoc mismatches found in previous methods. Theoretical foundations are established to link cycle loss minimization to stability in classification log-odds.
Results
The BiCyc approach shows a marked reduction in forgetting and improved accuracy on standard EFCIL benchmarks. It outperforms state-of-the-art methods in from-scratch settings and remains competitive in pretrained fine-grained scenarios, demonstrating its effectiveness in mitigating representation drift.
Implications
The findings suggest that bidirectional alignment and cycle consistency can significantly enhance the performance of models in continual learning scenarios, particularly in applications where data retention is not feasible. This could be particularly useful in fields such as robotics, autonomous systems, and any domain requiring adaptive learning without access to historical data.
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
NLP
Large Language Models
Theory
- Introduces PRECISE, a framework for bias-corrected ranking evaluation using LLMs.
- Achieves unbiased estimates for hierarchical metrics like Precision@K.
- Demonstrates a 21% reduction in standard error when augmenting human annotations with LLM judgments.
- Successfully identifies the best system variant in a production setting, leading to increased sales.
Read more
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
Summary
This paper presents a novel framework called PRECISE, which extends Prediction-Powered Inference (PPI) to provide bias-corrected estimates for ranking evaluation metrics by integrating a small set of human-labeled data with a larger set judged by a large language model (LLM). The authors address the challenge of systematic biases in LLM judgments that can distort evaluation metrics. By statistically correcting for these biases, the framework achieves unbiased estimates even when the LLM has a flawed error profile. The methodology is particularly applicable to hierarchical metrics like Precision@K, where human annotations are per-document while the metric is computed per-query. The authors demonstrate that augmenting a small number of human annotations with LLM judgments significantly reduces the standard error of estimates. In practical applications, the framework successfully identified the best variant among three system options in a production environment, leading to a notable increase in daily sales and click-through rates. The results indicate that the proposed method not only enhances the reliability of ranking evaluations but also has the potential to be generalized to other metrics that aggregate fine-grained judgments.
Methodology
The methodology involves extending Prediction-Powered Inference (PPI) to combine a small gold set of human annotations with a large set of LLM-annotated data. The bias-correction mechanism adjusts for the systematic errors of the LLM, allowing for unbiased metric estimates. The approach also reformulates the output space for hierarchical metrics to make computation tractable.
Results
The application of the PRECISE framework on the ESCI benchmark showed a reduction in standard error from 4.45 to 3.50 for Precision@4 estimates, representing a 21% relative reduction. In a production A/B test, the framework accurately ranked system variants, leading to a +407 basis points increase in daily sales and +571 basis points in click-through rates.
Implications
The findings suggest that the PRECISE framework can significantly improve the reliability of ranking evaluations in various applications, particularly in settings where human evaluation is costly. The methodology can be adapted for other metrics that require aggregation of detailed judgments, potentially transforming evaluation practices in machine learning and information retrieval.
Diffusion Models for Adaptive Sequential Data Generation
Generative Models
Time Series
Theory
- Introduction of a new diffusion model framework (AD-Seq) for adapted sequential data generation.
- Ensures that generated data respects temporal information flow and causal structure.
- Develops a novel score-matching objective for scalable parallel training.
- Provides statistical learning theory guarantees for the proposed framework.
Read more
Diffusion Models for Adaptive Sequential Data Generation
Summary
This paper addresses the challenge of generating realistic synthetic sequential data using diffusion models while preserving the temporal dependence and adaptiveness required in various applications such as finance, healthcare, and operations research. The authors propose a novel framework called Adaptive Diffusion for Sequential Data Generation (AD-Seq), which employs a sequential forward-backward diffusion process. This method ensures that each generated data point depends only on previously generated values, thus maintaining the causal structure of the data. The authors introduce a new score-matching objective for efficient parallel training and provide rigorous statistical guarantees for the framework. They validate their approach through empirical experiments on synthetic datasets, including ARMA models and Gaussian processes, demonstrating its effectiveness in constructing mean-variance optimal portfolios. The proposed framework not only facilitates time series generation but also has broader applications in predictive decision-making and statistical inference under information-flow constraints.
Methodology
The authors propose a sequential forward-backward diffusion framework that generates time series data by conditioning on previously generated history. They introduce a score-matching objective that allows for parallel training of score functions, ensuring scalability. The framework is analyzed under a general statistical learning theory that includes guarantees for score approximation, estimation, and distribution estimation.
Results
The empirical validation shows that the AD-Seq framework effectively generates synthetic sequential data that adheres to the required temporal dependencies. The method successfully constructs mean-variance optimal portfolios, demonstrating its practical applicability in financial contexts.
Implications
The AD-Seq framework opens new avenues for generating synthetic data in various fields where temporal information flow is critical. It can be applied in multi-step prediction, predictive decision-making, and statistical inference, making it a valuable tool for researchers and practitioners dealing with sequential data.
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
NLP
Large Language Models
Reinforcement Learning
- Introduction of AH-GRPO, which adapts token-level discounting based on entropy to improve training efficiency.
- Development of SA-AH-GRPO, which selectively applies discounting to negative-advantage rollouts, enhancing learning stability.
- SA-AH-GRPO achieves a 3.6× reduction in training variance on the 3B model while maintaining peak accuracy.
- Demonstrated improvements over zero-shot baselines, indicating the effectiveness of the proposed methods.
Read more
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
Summary
This paper introduces two novel extensions to the Group Relative Policy Optimisation (GRPO) algorithm for reinforcement learning in language models: Adaptive-Horizon GRPO (AH-GRPO) and Selective-Advantage AH-GRPO (SA-AH-GRPO). The standard GRPO algorithm treats all token positions and rollouts symmetrically, which can lead to inefficiencies in training. AH-GRPO incorporates an entropy-based discount that adjusts the effective horizon based on the model's uncertainty, while SA-AH-GRPO selectively applies this discount only to negative-advantage rollouts, preserving the learning signal from successful trajectories. The authors benchmark these methods on the GSM8K mathematical reasoning task using two different model sizes (1.5B and 3B parameters). The results demonstrate that SA-AH-GRPO achieves superior performance, with a peak Pass@1 accuracy of 0.858 on the 3B model and a significant reduction in training variance compared to standard GRPO. This work highlights the importance of asymmetric discounting in reinforcement learning for structured generation tasks, suggesting a principled approach to improving model training stability and accuracy.
Methodology
The authors propose two hierarchical extensions to GRPO: AH-GRPO, which uses an entropy-adaptive discount to adjust the effective gradient horizon based on uncertainty, and SA-AH-GRPO, which applies this discount selectively to negative-advantage rollouts. The methods were evaluated on the GSM8K benchmark using two fine-tuned language models, comparing performance metrics such as Pass@1 accuracy and training variance.
Results
SA-AH-GRPO achieved a peak Pass@1 accuracy of 0.858 on the 3B model and 0.686 on the 1.5B model, with a significant reduction in training variance (0.0246) compared to GRPO. The method also outperformed the zero-shot baseline by 4.9 percentage points on the 1.5B model.
Implications
The findings suggest that asymmetric discounting based on token-level uncertainty can lead to more efficient training of language models, particularly in tasks requiring structured reasoning. This approach could be beneficial for various applications in natural language processing and reinforcement learning.
Steering Vectors are an Adversarial Attack Surface
NLP
Large Language Models
Optimization
- Identification of contrastive steering datasets as a novel attack surface.
- Demonstration of a stealthy data poisoning attack that alters steering vectors.
- Validation of the attack on multiple model families and attributes, achieving significant ASR improvements.
- Proposal of a defense mechanism that mitigates the attack's effectiveness.
Read more
Steering Vectors are an Adversarial Attack Surface
Summary
This paper investigates the vulnerabilities associated with activation steering in Large Language Models (LLMs), a technique that allows users to control model behavior without fine-tuning. The authors reveal that the sharing of steering datasets and precomputed vectors introduces a significant supply-chain threat, where a stealthy data poisoning attack can compromise the integrity of the steering vectors. By subtly altering 4-6% of tokens in the contrastive pairs used to compute these vectors, an attacker can align the resulting vector with an anti-refusal direction, effectively jailbreaking the model while maintaining its intended behavior on benign prompts. The study tests this attack on two open-weight model families and eight model-attribute combinations, demonstrating a notable increase in attack success rates (ASR) of 20-55%, significantly higher than clean references. Additionally, the authors propose a defense mechanism that can recover approximately 82% of the ASR gap without negatively impacting benign behavior, highlighting the need for robust defenses against such vulnerabilities.
Methodology
The authors employed a GCG-style optimization approach constrained to embedding-space neighbors, utilizing fluency penalties and safe-vocabulary filtering to create poisoned contrastive pairs. They validated the effectiveness of the attack on two open-weight model families and various model-attribute combinations, measuring the attack success rates and the impact on benign prompts.
Results
The study found that the poisoned vectors achieved an absolute attack success rate (ASR) of 20-55%, which is an increase of 19-51 percentage points over clean vectors. The defense mechanism proposed was able to recover approximately 82% of the ASR gap without harming the model's performance on benign prompts.
Implications
The findings underscore the importance of securing the supply chain in machine learning applications, particularly in LLMs, where shared datasets can be easily compromised. The proposed defense strategies could inform future research on enhancing model robustness against adversarial attacks.
Adaptive state-action abstractions via rate-distortion
Reinforcement Learning
Robotics
Theory
- Introduces soft state-action abstractions that allow for dynamic granularity adjustment.
- Develops a learning-abstraction decomposition that separates value error into learning and abstraction errors.
- Proposes an adaptive abstraction principle that refines abstractions based on learning progress.
- Validates the framework on tabular control benchmarks, achieving near-optimal performance with lossy compression.
Read more
Adaptive state-action abstractions via rate-distortion
Summary
This paper addresses the challenge of dynamically adjusting the granularity of state-action abstractions in reinforcement learning (RL). Drawing inspiration from how infants learn to walk by first mastering coarse tasks before refining them, the author proposes a principle for refining abstractions based on the relationship between learning error and abstraction error. The key contribution is a performance certificate that decomposes value error into a Bellman residual (representing learning error) and an abstraction error bound (measured by a bisimulation metric). The paper introduces soft state-action abstractions derived from rate-distortion principles, allowing for continuous adjustment of resolution along state and action axes. The proposed adaptive rule suggests that refinement should occur when the learning error becomes comparable to the abstraction error. The framework is validated through experiments on classic tabular control benchmarks and a SysAdmin scaling test, demonstrating that it achieves near-optimal performance even with significant lossy compression of state and action information. This work not only provides a method for adaptive abstraction refinement but also quantifies the compressibility of tasks in terms of state and action information.
Methodology
The methodology involves creating a continuous family of soft state-action abstractions using rate-distortion principles. The author formulates a performance certificate that decomposes value error into a Bellman residual and a bisimulation metric. An adaptive rule is established to refine abstractions based on the relationship between learning error and abstraction error.
Results
The experiments show that the proposed adaptive abstraction method can achieve near-optimal performance in various tabular settings, even under substantial compression of state and action information. The results indicate that the adaptive rule effectively traces meaningful compression-distortion frontiers.
Implications
The findings suggest that reinforcement learning agents can benefit from dynamically adjusting the granularity of their abstractions, leading to more efficient learning processes. This approach could be applied in complex environments where managing state and action information is crucial for optimal performance.
Short paper: Models in the dark -- Rectification and erasure under GDPR in ML supply chains
Theory
- The paper identifies significant challenges in enforcing GDPR rights to rectification and erasure in ML systems.
- It introduces the concept of 'models in the dark,' highlighting issues of transparency and traceability in ML supply chains.
- The authors provide a taxonomy of challenges that impede the effective enforcement of these rights.
- The study emphasizes the need for interdisciplinary approaches to bridge legal and technical aspects of GDPR compliance in ML.
Read more
Short paper: Models in the dark -- Rectification and erasure under GDPR in ML supply chains
Summary
This paper addresses the challenges of implementing the General Data Protection Regulation (GDPR) rights to rectification and erasure within machine learning (ML) systems, particularly in the context of complex ML supply chains. The authors highlight that existing research has primarily focused on either the legal or technical aspects of these rights, neglecting the interplay between them in real-world ML applications. They introduce the concept of 'models in the dark,' referring to derived models created downstream in ML chains without adequate transparency or traceability. The paper provides a comprehensive survey of the legal, technical, and operational challenges faced in enforcing these rights, revealing that many GDPR requirements remain unmet in practice. By bridging legal and technical literature, the authors aim to establish effective procedures for enforcing data subject rights in ML contexts, ultimately contributing to the development of trustworthy AI.
Methodology
The authors conducted a literature review of existing academic works and guidance from data protection authorities, focusing on the legal and technical challenges of implementing GDPR rights in ML contexts. They analyzed these challenges within the framework of ML supply chains.
Results
The paper outlines a taxonomy of challenges related to the rights to rectification and erasure, demonstrating that many GDPR requirements cannot be technically fulfilled in current ML practices. It also emphasizes the lack of sufficient research addressing the complexities of ML supply chains and their implications for data subject rights.
Implications
The findings of this paper have significant implications for policymakers, researchers, and practitioners in the field of AI and data protection. By highlighting the gaps in current practices and proposing a framework for addressing these issues, the paper aims to foster the development of more transparent and accountable ML systems that comply with GDPR requirements.
Design a Reliable LLM-Integrated Interface for Mortality Forecasting
NLP
Large Language Models
Time Series
- Development of a user-friendly LLM-integrated interface for mortality forecasting.
- Implementation of a three-phase methodology to ensure accuracy, usability, and transparency.
- Demonstration of the effectiveness of LLMs in translating natural language into structured forecasting requests.
- Focus on maintaining statistical rigor while enhancing accessibility for non-technical users.
Read more
Design a Reliable LLM-Integrated Interface for Mortality Forecasting
Summary
This paper addresses the complexities of mortality forecasting, which is crucial for actuarial and policy decision-making but often inaccessible to non-expert users. The author proposes a reliable interface integrated with a large language model (LLM) that enhances usability while preserving statistical power. The methodology consists of three phases: first, a baseline forecasting pipeline is established using the CoMoMo package to reproduce established mortality forecasting results; second, the pipeline is extended to generate multi-step forecasts through rolling-origin evaluation and mean squared error (MSE); and third, a prototype interface is developed that allows users to interact in plain language. This system demonstrates that LLMs can improve accessibility without sacrificing reproducibility, transparency, or actuarial validity in analytical workflows. The research highlights the importance of making mortality forecasting tools user-friendly, particularly for stakeholders in public organizations and private risk management, thereby supporting evidence-based decision-making.
Methodology
The methodology involves three phases: 1) implementing a baseline mortality forecasting pipeline using the CoMoMo package, 2) extending the pipeline for multi-step forecasts with rolling-origin evaluation and MSE, and 3) developing a prototype interface that utilizes a local LLM for natural language interaction.
Results
The system successfully reproduces established mortality forecasting results and demonstrates that LLMs can facilitate user interaction without compromising the accuracy or transparency of the forecasting process.
Implications
The proposed LLM-integrated interface can significantly enhance the accessibility of mortality forecasting tools for non-experts, thereby supporting better decision-making in actuarial practices and public policy formulation. It emphasizes the importance of transparency and reproducibility in high-stakes analytical workflows.
From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems
Theory
Robotics
- Identifies four critical developmental conditions for agency in neural systems.
- Introduces 'agency gain' as a measurable metric for self-awareness in predictive systems.
- Demonstrates that self-aware predictors outperform self-blind predictors in various environments.
- Falsifies 12 hypotheses regarding the development of self-representation.
Read more
From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems
Summary
This paper investigates the transition from a predictive system to one that recognizes its own causal influence, using a minimal 192-dimensional Gated Recurrent Unit (GRU) model. Through 40 controlled experiments arranged in a developmental sequence, the author identifies four necessary conditions for a system to distinguish self-caused changes from world-caused changes: (1) persistent state that forms stable attractors, (2) a causal action loop linking the system’s output to its input, (3) proprioceptive feedback that makes implicit causal knowledge explicit, and (4) asynchronous awakening, where perceptual learning must consolidate before action learning begins. The study introduces 'agency gain' as a metric to measure the predictive advantage of a system that understands its own actions. The results show that the self-aware predictor consistently outperforms the self-blind predictor in both periodic and chaotic environments, and the agency gain metric remains valid even after the removal of auxiliary components. The paper also presents 12 falsified hypotheses that clarify the boundaries between predictive systems and those that possess self-representation, emphasizing that self-representation is only maintained when it is causally useful.
Methodology
The study employs a minimal GRU architecture and conducts 40 controlled experiments, systematically adding components to the model to observe changes in its ability to distinguish self-caused from world-caused changes. Each experiment is designed to isolate the effects of specific components, allowing for precise attribution of results.
Results
The findings reveal that the four identified conditions must be satisfied in a strict order for a predictive system to develop self-representation. The self-aware predictor consistently shows lower prediction errors compared to the self-blind predictor, and the agency gain metric remains robust even after the removal of auxiliary components. The study also highlights that alternative approaches to self-representation fail for specific reasons.
Implications
This research provides a foundational understanding of how minimal neural systems can develop agency and self-representation, which could inform future studies in robotics, artificial intelligence, and cognitive science. It also offers insights into the design of systems that can learn to distinguish their own actions from external influences.
Q-GNN: Query-Conditioned Graph Neural Networks with Type Awareness for Knowledge Graph Completion
Graph Learning
- Q-GNN incorporates both query entity and query relation information for enhanced reasoning in KGC.
- The approach utilizes structural context and semantic type to guide message passing and scoring.
- Experiments show that Q-GNN outperforms traditional GNN methods in knowledge graph completion tasks.
- The integration of large language models for entity type inference is a novel aspect of the methodology.
Read more
Q-GNN: Query-Conditioned Graph Neural Networks with Type Awareness for Knowledge Graph Completion
Summary
The paper introduces Q-GNN, a novel approach to Knowledge Graph Completion (KGC) that enhances the reasoning process by incorporating both query entity and query relation information. Traditional Graph Neural Network (GNN) methods primarily utilize the query relation as the guiding signal, neglecting the valuable information contained in the query entity. Q-GNN addresses this limitation by integrating two perspectives: the structural context surrounding the query entity and its semantic type, inferred using a large language model. The structural context is encoded through a dedicated context encoder, which modulates message passing, while the semantic type is incorporated into attention mechanisms and scoring processes. This dual incorporation allows Q-GNN to effectively guide reasoning based on both the query entity and relation, leading to improved performance in predicting missing triplets in knowledge graphs. Experimental results on standard benchmarks validate the effectiveness of Q-GNN, demonstrating its superiority over existing GNN-based methods.
Methodology
Q-GNN employs a two-pronged approach to enhance KGC by integrating query entity information. It utilizes a context encoder to capture the structural context of the query entity and a large language model to infer the semantic type of the entity. This information is then used to modulate message passing and scoring in the GNN framework, allowing for a more nuanced understanding of the query's context.
Results
The experimental results demonstrate that Q-GNN significantly improves the accuracy of predicting missing triplets in knowledge graphs compared to existing GNN-based methods. The incorporation of both structural context and semantic type leads to better representation learning and reasoning capabilities.
Implications
Q-GNN has potential applications in various domains that rely on knowledge graphs, such as recommendation systems, question answering, and drug discovery. By improving KGC, it can enhance the performance of downstream applications that depend on complete and accurate knowledge representations.
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
Time Series
- Tabular Foundation Models can effectively handle fragmented and irregular industrial time series data.
- The proposed framework allows for in-context learning, reducing the need for extensive retraining.
- Tabular models outperform traditional sequence models and gradient-boosted trees in various PHM tasks.
- Performance is enhanced by constructing representative contexts during data subsampling.
Read more
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
Summary
This paper addresses the challenges in Prognostics and Health Management (PHM) by proposing a framework that utilizes Tabular Foundation Models for analyzing industrial time series data. Traditional PHM approaches often struggle with fragmented, partially observed, and poorly labeled data, which complicates the application of supervised learning techniques. The authors introduce a method to convert raw unit-level signals into tabular formats, allowing for effective in-context learning. This approach enables the models to perform well across various PHM tasks, including diagnostics and prognostics, while being data-efficient. The study compares the performance of tabular foundation models against sequence models, transformer baselines, and gradient-boosted trees under a unified evaluation protocol. The results demonstrate that tabular foundation models achieve superior average ranks in prognostic and diagnostic tasks, particularly in low-data scenarios. Furthermore, the research highlights the importance of constructing representative contexts during subsampling to maintain performance. Overall, the findings suggest that tabular foundation models can serve as a practical and versatile solution for diverse PHM challenges.
Methodology
The authors developed a framework that transforms raw time series data into tabular representations, enabling the use of Tabular Foundation Models. They employed in-context learning techniques to adapt model behavior at inference time without retraining. The performance of these models was evaluated against traditional sequence models, transformer baselines, and gradient-boosted trees using a common evaluation protocol across multiple PHM tasks.
Results
The study found that tabular foundation models achieved the best average ranks across prognostic and diagnostic tasks. They demonstrated competitive performance in low-data scenarios and maintained effectiveness by preserving temporal context in the tabular representation. The results indicate that the proposed models are robust and adaptable to varying industrial conditions.
Implications
The findings suggest that tabular foundation models can significantly enhance the scalability and efficiency of PHM systems in industrial settings. By reducing the reliance on extensive labeled datasets and allowing for rapid adaptation to changing conditions, this approach can facilitate broader adoption of advanced predictive maintenance strategies.
High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model
Theory
Efficient ML
Large Language Models
- Introduction of a solvable high-dimensional model for LoRA fine-tuning in attention.
- Derivation of a sharp asymptotic characterization of test error and reconstruction overlap using finite-dimensional order parameters.
- Identification of an effective noise mechanism that quantifies the impact of pre-training quality on fine-tuning performance.
- Discovery of regimes with mismatches between test error and reconstruction overlap due to memorization of pre-training data.
Read more
High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model
Summary
This paper presents a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, focusing on the relationship between pre-training and fine-tuning. The authors introduce a solvable framework where a single-head attention layer is pre-trained on a large dataset and then fine-tuned using a rank-one LoRA update on a smaller dataset. In the high-dimensional limit, both stages are characterized by a finite set of order parameters, allowing for explicit predictions regarding test errors and representation alignment. The analysis reveals that the influence of pre-training on LoRA can be encapsulated by an effective noise term, leading to guidelines for optimal pre-training strategies. Additionally, the authors identify scenarios where there is a discrepancy between test error and representation quality during fine-tuning, and propose applications of their theory to active fine-tuning.
Methodology
The authors develop a two-stage model where a single-head attention layer is pre-trained on a large dataset followed by a rank-one LoRA fine-tuning on a smaller dataset. They utilize high-dimensional statistical analysis to derive asymptotic characterizations and establish relationships between pre-training and fine-tuning performance through order parameters.
Results
The main results include a detailed asymptotic characterization of the LoRA-fine-tuned estimator, explicit predictions for pre-training and fine-tuning test errors, and the identification of an effective noise term that influences fine-tuning performance. The study also highlights conditions under which fine-tuning can lead to discrepancies between test error and representation quality.
Implications
The findings suggest that understanding the interplay between pre-training and fine-tuning can enhance the efficiency of model adaptation in practical applications. The proposed guidelines for optimal pre-training and active fine-tuning can be beneficial for improving performance in various downstream tasks.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
NLP
Large Language Models
Efficient ML
- Fisher importance is a more effective metric for identifying critical dimensions in MoE models compared to existing heuristics.
- Fisher-MoE allows for fine-grained compression at the intermediate dimension level, preserving model performance while reducing size.
- The proposed method significantly improves inference throughput and reduces memory requirements.
- The study highlights the importance of evaluating MoE compression on challenging general-purpose benchmarks rather than solely on commonsense reasoning tasks.
Read more
Less is MoE: Trimming Experts in Domain-Specialist Language Models
Summary
This paper addresses the challenges associated with the deployment of Mixture-of-Experts (MoE) models, which, while powerful, have a large parameter footprint. Previous compression methods have failed when evaluated on general-purpose benchmarks, primarily due to their coarse granularity of compression at the expert level. The authors identify that important capabilities are distributed across experts but concentrated in a small subset of intermediate dimensions. They propose a novel approach, Fisher-MoE, which utilizes Fisher importance to rank and remove less critical intermediate dimensions rather than entire experts. This fine-grained compression method allows for a significant reduction in model size while maintaining performance. The authors demonstrate that at a 50% compression ratio, Fisher-MoE reduces weight memory by approximately 45% and improves inference throughput by 21%, thus providing a more efficient way to deploy MoE models without sacrificing their capabilities.
Methodology
The authors employ Fisher importance as a metric to evaluate the significance of intermediate dimensions within MoE models. They conduct controlled experiments comparing existing compression methods and demonstrate the effectiveness of Fisher importance in identifying critical dimensions. The Fisher-MoE approach is then implemented to perform fine-grained compression, allowing the removal of less important dimensions while retaining the overall model capability.
Results
Fisher-MoE achieves a 50% compression ratio while preserving model performance on various benchmarks. The method reduces weight memory by approximately 45% and enhances inference throughput by 21%. The results indicate that the critical capabilities of MoE models are concentrated in a small subset of intermediate dimensions, which can be effectively targeted for compression.
Implications
The findings suggest that employing fine-grained compression techniques can lead to more efficient deployment of large language models, making them more accessible for real-world applications. This approach could be particularly beneficial in resource-constrained environments where memory and computational efficiency are critical.
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation
NLP
Large Language Models
- LLMs can detect fabricated statistics in isolation but fail to do so during multi-source synthesis.
- Source influence is governed by methodology presentation rather than numeric validity.
- The study identifies a methodology-register gate that affects how models evaluate evidence.
- Prompting-based mitigations do not effectively enhance models' ability to discern valid from fabricated statistics.
Read more
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation
Summary
This paper investigates the ability of large language models (LLMs) to evaluate the credibility of sources when synthesizing information from multiple inputs. The authors demonstrate that while LLMs can accurately detect fabricated statistics in isolation, they fail to apply this capability during multi-source synthesis, leading to similar numeric estimates regardless of the validity of the statistics presented. The study identifies a 'methodology-register gate' that influences source evaluation based on the presentation style of the analytical text rather than the actual numeric validity of the claims. This behavioral dissociation was replicated across five different models from three families and in three professional domains. Mechanistic analyses, including causal tracing and linear probes, reveal that while methodology representation is encoded and used across domains, numeric-validity signals are suppressed during synthesis. The authors also explore prompting-based mitigations, which result in blanket skepticism rather than selective discernment. The findings highlight a phenomenon termed 'epistemic alignment,' where models prioritize the appearance of analytical credibility over the substance of evidence quality.
Methodology
The authors conducted factorial behavioral experiments involving five model families across three professional domains, analyzing over one million trials. They employed mechanistic analyses such as causal tracing, linear probes, and component-level attribution to investigate how models process and synthesize information from multiple sources.
Results
The results indicate that LLMs treat fabricated statistics as valid during multi-source synthesis, despite being able to flag them in isolation. The methodology-sensitive representation was confirmed to transfer across domains, while numeric-validity signals collapsed to chance during synthesis. The study also found that prompting strategies did not improve the models' ability to discern valid statistics from fabrications.
Implications
The findings suggest that LLMs may inadvertently propagate misinformation by prioritizing the presentation of evidence over its validity. This has significant implications for the deployment of LLMs in critical decision-making contexts, such as medicine and finance, where accurate source evaluation is crucial. The concept of epistemic alignment calls for further research into improving models' ability to evaluate evidence quality.
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Theory
- Introduces the concept of 'two training clocks' to separate fitting from representation simplification.
- Demonstrates that classification loss decreases exponentially while representation simplification occurs on a polynomial time scale.
- Extends findings to ReLU networks, showing a two-stage learning mechanism.
- Provides a rigorous mathematical framework using deep linear networks.
Read more
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Summary
This paper investigates the phenomenon of 'grokking' in deep learning, where a model achieves low training error but only later discovers a simpler underlying rule that generalizes well to new data. The authors introduce the concept of 'two training clocks' to formalize the separation between two distinct processes during training: fitting the training data and simplifying the learned representation. The first clock, the classifier clock, measures the time required for the model to effectively fit the training labels, while the second clock, the representation clock, tracks the time needed for the internal representation to evolve into a simpler form. The study employs deep linear networks to analyze these processes mathematically, showing that the classification loss can decrease exponentially on a logarithmic time scale, while representation simplification occurs on a polynomial time scale due to regularization effects. The authors extend their findings to ReLU networks, demonstrating that once activation patterns stabilize, the network behaves like a linear model, allowing for a two-stage learning mechanism where the classifier fits first and the representation continues to simplify later. Experimental results on modular arithmetic tasks support the theoretical claims, highlighting the importance of continued training beyond initial fitting.
Methodology
The authors analyze deep linear networks trained with cross-entropy loss and layerwise weight decay to formalize the two training clocks. They derive conditions for loss reduction and representation simplification, and extend the analysis to ReLU networks through conditional reductions that account for empirical behavior.
Results
The study finds that the effective classifier can achieve low training loss rapidly, while the representation continues to evolve at a slower pace. Specifically, the classification loss decreases on a logarithmic scale, while representation simplification is governed by a polynomial time scale. Experimental results on modular addition tasks corroborate the theoretical framework, demonstrating the separation of fitting and representation processes.
Implications
The findings suggest that continued training can be beneficial not only for reducing error but also for reshaping the model's learned representations, which may lead to improved generalization on unseen data. This insight could influence training strategies in deep learning, particularly in tasks requiring complex rule learning.
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Generative Models
- Introduction of GILC as a training-free guidance framework for discrete diffusion models.
- Utilization of a Jacobian-free mechanism for stable logit correction, addressing gradient instability.
- Formal connection to policy gradients, enabling handling of non-differentiable objectives.
- Demonstration of state-of-the-art performance in various scientific applications without additional training.
Read more
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Summary
This paper introduces Gradient-Informed Logit Correction (GILC), a novel framework designed to enhance controllable generation in discrete diffusion models without the need for retraining. Traditional methods for guiding discrete diffusion often face challenges due to high computational costs and gradient instability in high-dimensional spaces. GILC addresses these issues by utilizing a pretrained denoising network as a variational proxy to estimate guidance signals. The framework employs a Jacobian-free mechanism to directly correct clean prediction logits, ensuring stable and effective guidance. GILC is versatile, accommodating both differentiable and non-differentiable reward functions, and operates entirely on existing objective functions without requiring fine-tuning. Extensive experiments across various scientific domains, including DNA and protein sequence generation, demonstrate that GILC achieves state-of-the-art performance, often surpassing traditional fine-tuning approaches while maintaining computational efficiency. The paper highlights the potential of GILC to significantly advance the field of discrete diffusion guidance.
Methodology
The GILC framework employs a variational approach where a pretrained denoising network serves as a proxy for value function estimation. It combines the Gumbel-Softmax trick with a Straight-Through estimator to maintain gradient flow in discrete spaces. The method introduces a Jacobian-free update to correct clean prediction logits, facilitating effective guidance in the generation process.
Results
GILC outperforms existing training-free discrete diffusion guidance methods in both sample quality and computational efficiency. It achieves competitive results compared to fine-tuning-based approaches, establishing state-of-the-art performance for controlled discrete generation tasks across multiple scientific domains.
Implications
The GILC framework has significant implications for various applications requiring controllable generation, such as protein engineering and molecular design. Its training-free nature allows for broader accessibility and adaptability in scientific research and industrial applications, potentially accelerating advancements in these fields.
Flash-WAM: Modality-Aware Distillation for World Action Models
Generative Models
Robotics
Multimodal
- Flash-WAM introduces a modality-aware step-distillation framework for World Action Models.
- The framework adapts consistency functions to match the noise characteristics of video and action modalities.
- Flash-WAM achieves a 23× speedup in inference time, enabling real-time control.
- It preserves high task success rates in simulation benchmarks compared to naive distillation methods.
Read more
Flash-WAM: Modality-Aware Distillation for World Action Models
Summary
The paper introduces Flash-WAM, a novel modality-aware step-distillation framework designed to enhance the efficiency of World Action Models (WAMs) in generating future video and robot actions. Traditional WAMs, while effective, require numerous denoising steps that hinder real-time control. Flash-WAM addresses this by employing a consistency distillation approach tailored to the unique noise characteristics of each modality—video and action. The framework utilizes a linear-gradient-scaling for the action stream, which operates in a low-noise regime, and a variance-preserving approach for the video stream, which functions in a high-noise regime. This method allows for significant compression of inference time, achieving real-time performance by reducing per-chunk latency from 8.1 seconds to 348 milliseconds on the RoboTwin 2.0 benchmark. Additionally, Flash-WAM maintains high task success rates, outperforming naive consistency distillation methods that see a drastic drop in performance. The findings suggest that Flash-WAM can effectively bridge the gap between the computational demands of WAMs and the need for real-time robotic control.
Methodology
Flash-WAM employs a consistency distillation framework that differentiates between the video and action modalities based on their respective noise regimes. It utilizes a linear-gradient-scaling for low-noise action streams and a variance-preserving parametrization for high-noise video streams, allowing for effective training and inference with fewer denoising steps.
Results
The implementation of Flash-WAM on the LingBot-VA model resulted in a reduction of per-chunk latency from 8.1 seconds to 348 milliseconds, achieving a 23× speedup. The model maintained a task success rate of 85.5% on RoboTwin 2.0 and 95.7% on LIBERO, while naive consistency distillation methods dropped to 24% success at the same step budget.
Implications
Flash-WAM's advancements in real-time inference for WAMs could significantly enhance robotic control applications, enabling more responsive and efficient robotic systems in dynamic environments. This approach may also inform future research on modality-aware techniques in other areas of machine learning.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation
NLP
Large Language Models
Efficient ML
- Introduces Compress-Distill, a method for compressing reasoning traces before knowledge distillation.
- Compressed traces reduce training tokens to 12-30% of raw traces and speed up training by 2.0-7.6 times.
- While compressed traces improve efficiency, they do not surpass raw traces in downstream accuracy.
- The study includes a detailed analysis of the trade-offs between accuracy and efficiency in knowledge distillation.
Read more
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation
Summary
This paper addresses the challenge of efficiently distilling knowledge from reasoning models that produce lengthy chain-of-thought (CoT) traces. The authors propose a novel approach called Compress-Distill, which involves post-hoc compression of reasoning traces generated by large teacher models before they are used to train smaller student models. The study evaluates the effectiveness of this compression across two teacher models (Qwen3.5-397B-A17B and gpt-oss-120B) and various student models. The results indicate that while compressed traces significantly reduce training costs and inference lengths, they do not outperform raw traces in terms of downstream accuracy. The authors provide a comprehensive analysis of the trade-offs between accuracy and efficiency, demonstrating that compressed traces can achieve up to 96% of the accuracy of raw traces while improving per-token efficiency by up to 18 times. This work contributes to the understanding of reasoning-trace compression as a practical trade-off rather than a straightforward enhancement, offering insights into the balance between model performance and computational efficiency.
Methodology
The methodology consists of a three-stage pipeline: (1) Trace Generation, where reasoning traces are generated by teacher models and verified for correctness; (2) Trace Compression, where the correct traces are compressed using instruction-tuned models; and (3) Student Training, where student models are fine-tuned on raw, compressed, or answer-only targets. The evaluation includes a comprehensive grid of experiments across multiple teacher and student models.
Results
The experiments reveal that compressed traces lead to a substantial reduction in training costs and inference lengths, achieving training token reductions of 12-30% and speeding up training by 2.0-7.6 times. However, raw traces consistently yield higher downstream accuracy across all evaluated scales. The analysis shows that while compressed traces maintain a significant portion of the accuracy of raw traces, they do not exceed it, particularly at the 0.8B scale under LoRA.
Implications
The findings suggest that while trace compression can enhance efficiency in knowledge distillation, it is essential to consider the trade-offs involved. This work has implications for the design of more efficient training pipelines for reasoning models, particularly in applications where computational resources are limited.
Learned Subspace Compression for Communication-Efficient Pipeline Parallelism
Large Language Models
Efficient ML
Optimization
- Introduces MAPL, a method for learnable orthogonal projections in pipeline parallelism.
- Maintains orthogonality during training using Stiefel manifold constraints.
- Allows each pipeline stage to adapt its own compression subspace, enhancing performance.
- Integrates factorized anchor embeddings for efficient activation reconstruction.
Read more
Learned Subspace Compression for Communication-Efficient Pipeline Parallelism
Summary
The paper addresses the challenges of communication bottlenecks in pipeline parallelism when training large language models on low-bandwidth networks. It critiques existing methods that use fixed orthogonal projections for compressing activations, which lead to performance degradation and require complex adaptations. The authors propose a novel approach called Manifold Aware Projection Learning (MAPL), which allows each pipeline stage to learn its own low-rank projection while maintaining orthogonality through explicit Stiefel manifold constraints. This method enables the discovery of task-optimal compression subspaces and introduces per-stage factorized anchor embeddings for effective activation reconstruction. The authors also integrate residual vector quantization with a streaming codebook synchronization protocol to further reduce communication overhead. The results demonstrate that MAPL significantly improves the trade-off between performance and compression, achieving high compression rates with minimal performance loss across various LLaMA models ranging from 150M to 1B parameters.
Methodology
The authors developed MAPL, which employs manifold-constrained steepest descent updates to keep projection matrices on the Stiefel manifold during training. This approach allows for the learning of low-rank projections specific to each pipeline stage, avoiding the pitfalls of fixed global subspaces. The method also includes factorized anchor embeddings and a residual vector quantization strategy to optimize communication efficiency.
Results
MAPL was tested on LLaMA models with parameters ranging from 150M to 1B. The results showed that MAPL achieved high compression rates (up to 16×) with negligible performance degradation, maintaining validation cross-entropy close to uncompressed training. In contrast, existing methods like Subspace Networks exhibited significant performance drops as compression increased.
Implications
The proposed MAPL method has the potential to enhance the training efficiency of large-scale language models in low-bandwidth environments, making it applicable in real-world scenarios where resource constraints are prevalent. This could facilitate more accessible AI model training across diverse hardware setups.
Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
Computer Vision
- Introduces a manifold-aware approach to prototype rehearsal for EFCIL.
- Proposes Constrained Expansive Over-Sampling (CEOS) to generate boundary-aware synthetic samples.
- Develops an Adaptive Class-Balanced (ACB) loss to address class imbalance during training.
- Demonstrates that the proposed methods outperform traditional prototype rehearsal and compete with drift-compensation techniques.
Read more
Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
Summary
This paper addresses the challenges of exemplar-free class-incremental learning (EFCIL), particularly focusing on the limitations of traditional prototype rehearsal methods that lead to catastrophic forgetting. The authors propose a novel approach that enhances prototype rehearsal by incorporating manifold-aware boundary sampling and an adaptive class-balanced loss. They argue that existing methods fail to leverage the geometric relationships between classes and do not adequately address class imbalance during training. The proposed Constrained Expansive Over-Sampling (CEOS) method generates synthetic samples by interpolating old-class prototypes towards their nearest enemy features from new classes, ensuring that the samples remain on the correct side of the decision boundary. Additionally, the Adaptive Class-Balanced (ACB) loss function adjusts the influence of old-class prototypes over time, amplifying their gradients when they are most informative and gradually balancing their contribution as new class data accumulates. The results demonstrate that this redesigned prototype rehearsal framework significantly closes the performance gap with state-of-the-art drift-compensation methods, achieving superior results across multiple EFCIL benchmarks.
Methodology
The authors developed a two-pronged approach: (1) Constrained Expansive Over-Sampling (CEOS) that interpolates old-class prototypes towards nearest enemy features to create synthetic samples that respect class boundaries, and (2) Adaptive Class-Balanced (ACB) loss that dynamically adjusts the contribution of old-class prototypes based on their relevance during training, thereby addressing class imbalance.
Results
The proposed methods achieved state-of-the-art performance on multiple EFCIL benchmarks, effectively closing the performance gap with recent drift-compensation methods and demonstrating the viability of enhanced prototype rehearsal strategies.
Implications
This research has significant implications for the development of continual learning systems that must operate under strict memory constraints and privacy concerns, enabling more effective learning from new data while retaining knowledge of previously learned classes.
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
NLP
Large Language Models
Theory
- Introduces a stereological framework for understanding benchmark coverage in LLMs.
- Identifies a significant structural blind spot in LLM evaluations, dominating statistical noise.
- Develops a submodular greedy algorithm for optimal benchmark selection, achieving high coverage with fewer benchmarks.
- Empirical analysis shows effective dimensionality of benchmark suites and its implications for model evaluation.
Read more
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
Summary
This paper presents a stereological theory to analyze the benchmark coverage of large language models (LLMs). It establishes that the visible Hausdorff distance between two convex capability profiles, which yield the same scores, is constrained by a specific mathematical relationship involving effective dimensionality (deff). The author empirically evaluates three independent leaderboards and finds that their deff values range from 2.86 to 4.80, indicating a significant structural blind spot in LLM evaluations that exceeds the observed score gaps by two orders of magnitude. The paper also introduces a submodular greedy algorithm that identifies a stable core of benchmarks necessary for effective model evaluation, demonstrating that only 7 out of 12 benchmarks can achieve 90% coverage. Additionally, the author resolves Gardner's Problem 1.5, providing a minimax rate for C2 support functions in general dimensions. The findings highlight the limitations of current benchmark evaluations and propose methods to improve model assessment through better benchmark selection.
Methodology
The paper employs theoretical analysis and empirical validation to explore benchmark coverage. It uses mathematical theorems to establish bounds on the indistinguishability of model capabilities based on benchmark scores. A submodular greedy algorithm is applied to optimize benchmark selection, and simulations are conducted to assess the stability and transferability of benchmark subsets across different temporal contexts.
Results
The study finds that the effective dimensionality of benchmark suites ranges from 2.86 to 4.80, with a structural blind spot significantly larger than observed score gaps. The submodular algorithm identifies a core set of benchmarks that can achieve 90% coverage with only 7 out of 12 benchmarks, and the model's predictions about irreplaceable evaluations are validated with a strong correlation coefficient (ρ = -0.69, p = 0.013).
Implications
The findings suggest that current LLM evaluations may be misleading due to the geometric blind spot, calling for a reevaluation of benchmark selection practices. The proposed methods for optimizing benchmark coverage could lead to more accurate assessments of model capabilities, ultimately improving the development and deployment of LLMs.
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Graph Learning
Time Series
Multimodal
- Developed an unsupervised machine learning framework for identifying Huntington's disease stages.
- Utilized graph representation learning to capture temporal relationships in longitudinal clinical data.
- Achieved robust clustering performance with significant distinctions between identified disease stages.
- Demonstrated the potential for a more objective, data-driven approach to HD staging.
Read more
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Summary
This paper presents an innovative unsupervised machine learning framework designed to identify stages of Huntington's disease (HD) by utilizing graph representation learning and clustering techniques. Traditional clinical staging methods often rely on predefined thresholds and expert assessments, which can obscure intra-stage variability and introduce inter-rater inconsistencies. The proposed framework addresses these limitations by encoding longitudinal clinical data into compact latent representations that capture the temporal relationships among patients. The authors applied K-means++ clustering to these representations, iteratively assessing cluster robustness through stability analysis. The framework was tested on a dataset from the Enroll-HD cohort, involving 302 individuals and 1,477 visits, revealing two primary disease stages and four meaningful stages with statistically significant distinctions. The clustering performance metrics, including a Silhouette score of 0.67 and a Davies–Bouldin index of 0.56, indicate the framework's effectiveness in providing a more objective and data-driven understanding of HD progression compared to existing clinical methods. This approach holds promise for integrating multimodal biomarkers to enhance insights into HD dynamics.
Methodology
The methodology involves encoding longitudinal clinical data into latent representations using graph-based representation learning. K-means++ clustering is then applied to these representations, with an iterative process for determining the optimal number of clusters and conducting stability analysis to assess robustness.
Results
The framework identified two primary disease stages and four meaningful stages with clear clinical measurement boundaries, achieving a Silhouette score of 0.67, a Davies–Bouldin index of 0.56, and a Calinski–Harabasz score of 453, indicating strong clustering performance and minimal overlap between stages.
Implications
This framework provides a more objective and data-driven foundation for staging Huntington's disease, potentially improving patient grouping, personalized care, and treatment discovery. It also opens avenues for integrating objective multimodal biomarkers to further enhance understanding of disease progression.
Your GFlowNet Secretly Learns an Optimal Transport Plan
Optimization
Generative Models
Graph Learning
- Establishes a theoretical link between GFlowNets and optimal transport problems.
- Demonstrates that minimum-flow GFlowNets can be formulated as linear programming problems.
- Shows that GFlowNets can effectively approximate solutions to graph optimal transport problems.
- Confirms the framework's ability to recover exact OT solutions and its scalability for larger problems.
Read more
Your GFlowNet Secretly Learns an Optimal Transport Plan
Summary
This paper establishes a theoretical connection between Generative Flow Networks (GFlowNets) and optimal transport (OT) problems. The authors demonstrate that by fixing the initial flow distribution in a minimum-flow GFlowNet, the objective can be transformed into a Kantorovich OT problem with graph-induced shortest path costs. Consequently, the learned GFlowNet policy encodes an optimal transport plan from a source distribution to a target distribution. The authors also show that sampling trajectories from the minimum-flow GFlowNet recovers the corresponding optimal coupling. This formulation allows for the application of GFlowNet learning techniques to OT problems on large graphs, utilizing edge flows and neural parameterization. Experimental results validate the framework's effectiveness, confirming its ability to approximate exact OT solutions and demonstrating scalability in complex combinatorial spaces.
Methodology
The authors reformulate the minimum-flow GFlowNet learning problem as a linear programming problem. They fix the initial edge-flow distribution to transform the objective into a Kantorovich optimal transport problem. The methodology involves theoretical analysis and empirical validation through experiments comparing GFlowNet outputs with exact OT solutions.
Results
The results indicate that the GFlowNet framework can successfully learn optimal transport plans, achieving transport costs equivalent to those obtained from exact OT solvers. The experiments demonstrate that GFlowNets can scale effectively, approximating solutions in cases where exact methods become intractable.
Implications
The findings suggest that GFlowNets can be a powerful tool for solving optimal transport problems in various applications, including machine learning tasks that involve structured data. This could lead to advancements in fields such as combinatorial optimization, generative modeling, and more efficient sampling techniques.
Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
Large Language Models
Interpretability
Efficient ML
- Identifies a geometric mismatch between multi-dimensional feature structures and standard SAE assumptions.
- Introduces Subspace-Aware Sparse Autoencoders (SASA) to address feature splitting and redundancy.
- Proves that SASA can uniquely represent entire feature slices with a single group, improving interpretability.
- Demonstrates empirical advantages of SASA over traditional SAEs on large language models.
Read more
Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
Summary
This paper addresses the limitations of standard Sparse Autoencoders (SAEs) in mechanistic interpretability of large language models (LLMs) by introducing Subspace-Aware Sparse Autoencoders (SASA). The authors argue that traditional SAEs assume one-dimensional latent features, which leads to feature splitting and redundancy. They demonstrate that this assumption is incompatible with the multi-dimensional nature of features in LLMs, causing inefficiencies in both training and interpretation. SASA replaces single-vector decoders with learned subspaces, enforces block sparsity, and adapts the effective rank of groups using a nuclear-norm regularizer. This approach allows for the representation of entire feature slices with a single group, significantly improving sample complexity and interpretability. Empirical results on models like GPT-2 and Mistral-7B show that SASA reduces feature splitting, enhances monosemanticity, and performs comparably or better than standard SAEs while requiring fewer training tokens.
Methodology
The authors propose SASA, which utilizes learned decoder subspaces instead of single-vector decoders. They implement block sparsity through Top-s group gating and optimize the effective rank of each group with a nuclear-norm regularizer. The methodology includes a theoretical analysis of feature splitting and a comparison of SASA's performance against traditional SAEs on LLM activations.
Results
Empirical evaluations on GPT-2 and Mistral-7B show that SASA effectively reduces feature splitting and redundancy, leading to improved interpretability and monosemanticity. SASA matches or exceeds the performance of standard SAEs while training on approximately half the token budget, demonstrating a significant reduction in sample complexity.
Implications
The findings suggest that adopting subspace-aware approaches like SASA can enhance the interpretability of LLMs, making it easier to diagnose model behavior, improve robustness, and ensure alignment with human values. This could lead to more effective applications in areas requiring transparent AI decision-making.
Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal Virginia
Interpretability
- The study highlights the importance of integrating multiple risk factors in modeling pediatric asthma exacerbation.
- Three modeling techniques were compared, emphasizing the trade-off between interpretability and predictive power.
- The novel framework developed allows for the identification of nonlinear interactions among risk factors.
- Consensus across different modeling approaches provides robust insights into the relative risks of asthma exacerbation.
Read more
Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal Virginia
Summary
This paper addresses the modeling of pediatric asthma exacerbation (AE) by analyzing multiple risk factors, including air pollution, meteorological conditions, and socioeconomic factors in coastal Virginia. The study utilized a dataset of acute AE visits to a regional children's hospital from 2018 to 2023, incorporating ambient air pollution measurements and weather data. Three modeling techniques were compared: Generalized Linear Models (GLM) for baseline predictive power and interpretability, Neural Networks (NN) for maximal predictive capability, and a novel framework combining Poisson regression with sparse dictionary learning to identify parsimonious nonlinear interactions. The results indicated that all models provided consensus on the relative risks associated with various exposure variables, highlighting the synergistic interactions influencing AE. This work bridges statistical and machine learning approaches, offering insights that could inform public health interventions aimed at reducing asthma exacerbations in vulnerable populations.
Methodology
The study employed a combination of Generalized Linear Models (GLM), Neural Networks (NN), and a novel framework based on sparse dictionary learning to model pediatric asthma exacerbation. Data included ambient air pollution measurements, weather data, and socioeconomic factors, focusing on acute AE visits to a children's hospital over a five-year period.
Results
The comparative analysis of the models revealed that all approaches yielded similar estimates of relative risks for asthma exacerbation due to various environmental exposures. The novel framework successfully identified nonlinear interactions, enhancing interpretability while maintaining predictive accuracy.
Implications
The findings suggest that integrating multiple environmental and socioeconomic factors can improve the understanding of asthma exacerbation triggers. This could lead to more effective public health strategies and interventions aimed at reducing asthma-related health issues in children, particularly in urban and disadvantaged communities.
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Multimodal
- LEVANTE-bench provides a comprehensive dataset for comparing VLMs to children's cognitive performance.
- The benchmark evaluates VLMs across multiple scales, including task difficulty, item difficulty, and trial-level error distributions.
- Larger VLMs show better alignment with human performance at the task level, but struggle with finer details of children's cognitive errors.
- Smaller models may better reflect the cognitive errors of younger children in certain tasks.
Read more
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Summary
The paper introduces LEVANTE-bench, a benchmark designed to systematically compare vision-language models (VLMs) with children's cognitive abilities across various tasks and populations. Utilizing data from the Learning Variability Network (LEVANTE), the authors assess VLM performance on six cognitive tasks involving children aged 5-12 from three countries. The study reveals that while larger VLMs generally align better with human performance at the task level, their alignment with children's error distributions is inconsistent. Notably, smaller models sometimes better match the errors of younger children, and even the best VLMs struggle with complex reasoning tasks. This work highlights the partial alignment of current VLM architectures with children's cognitive abilities and emphasizes the need for further development in VLMs to accurately model human cognition.
Methodology
The authors constructed LEVANTE-bench using psychometrically validated tasks from the LEVANTE dataset, which includes cognitive tasks spanning math, reasoning, language, and social cognition. They systematically assessed VLMs on these tasks, comparing their performance with that of 1547 children across three countries, focusing on task-level accuracy, item-level alignment, and trial-level error distributions.
Results
The analysis revealed heterogeneous alignment between VLMs and children's cognitive abilities. While larger models performed better on overall task accuracy, their alignment with children's error distributions varied significantly across tasks. Some smaller models matched the errors of younger children more closely, and all models faced challenges with complex reasoning tasks such as matrix reasoning and mental rotation.
Implications
The findings suggest that VLMs can serve as tools for understanding human cognitive development, but current models require enhancements to fully capture the nuances of children's cognitive processes. This research could inform future developments in VLM architectures and their applications in educational and cognitive science contexts.
TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning
Time Series
- TS-ICL unifies forecasting and imputation in time series modeling.
- It incorporates covariates using a structured synthetic prior based on DAGs.
- Achieves state-of-the-art performance in zero-shot imputation benchmarks.
- Maintains competitive forecasting capabilities, especially with missing observations.
Read more
TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning
Summary
The paper introduces TS-ICL, a novel probabilistic In-Context Learning (ICL) encoder-regressor Transformer designed for time series modeling. Unlike existing models that primarily focus on forecasting, TS-ICL addresses the challenges of irregularly and partially observed time series by unifying forecasting and imputation tasks. It formulates time series tasks as timestamp-aligned regression problems, allowing for the incorporation of covariates through a structured synthetic prior based on Directed Acyclic Graphs (DAGs). This approach enables robust zero-shot generalization to unseen dependency structures. The authors demonstrate that TS-ICL achieves state-of-the-art performance in imputation tasks while remaining competitive in forecasting benchmarks, particularly excelling in scenarios with partially observed data. The model operates directly on timestamped observations, providing flexibility in handling missing or irregularly sampled data, which is crucial for practical applications in real-world time series analysis.
Methodology
TS-ICL employs a probabilistic Transformer architecture that treats time series modeling as a time-indexed in-context regression problem. It utilizes a structured synthetic prior to define relationships between targets and covariates, leveraging DAGs for dependency structures. This allows the model to efficiently handle both forecasting and imputation tasks in a single forward pass, accommodating irregular sampling and missing data.
Results
TS-ICL sets a new state-of-the-art in zero-shot imputation, outperforming both task-specific models and traditional time series foundation models (TFMs). It is reported to be up to 50 times faster than existing TFMs during inference while matching their performance in forecasting tasks, particularly under conditions of partial observation.
Implications
The development of TS-ICL has significant implications for real-world time series analysis, where data is often incomplete and irregular. Its ability to jointly forecast and impute missing values while incorporating external covariates makes it a versatile tool for various applications, including finance, healthcare, and environmental monitoring.
Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs
Time Series
Graph Learning
- Introduces a streaming architecture that captures cross-company information propagation in financial markets.
- Achieves rapid ingestion and inference times, making it suitable for real-time applications.
- Demonstrates significant improvements in prediction precision over traditional methods.
- Highlights the critical role of dynamic graph structures in detecting financial contagion.
Read more
Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs
Summary
This paper addresses the limitations of traditional per-ticker forecasting models in financial time-series analysis, which often overlook cross-company propagation of information. The author proposes a novel heterogeneous Rust-Python streaming architecture that utilizes a continuous-time graph to model cross-company attention driven by textual news data. The architecture features a zero-copy Rust edge for rapid ingestion of news records, achieving parsing times of approximately 100 nanoseconds and scanning times of around 1.2 microseconds. The inference process employs a multivariate Neural Hawkes Process with continuous-time LSTM states to propagate directed excitation among companies, while an adaptive pruning rule maintains computational efficiency during dynamic neighborhood updates. The system demonstrates an end-to-end processing latency of about 13 milliseconds per news record on a standard CPU. Evaluation on a dataset of 638 articles across 47 tickers reveals a precision lift of 1.70 times over random predictions and 3.36 times over a same-sector baseline at the 90th percentile next-day return threshold. The results underscore the importance of the dynamic attention network in capturing cross-company signals, as removing the graph topology results in a collapse of precision to zero.
Methodology
The methodology involves a two-part architecture: a zero-copy Rust edge for fast ingestion of news data and a multivariate Neural Hawkes Process for inference. The Rust component parses news records and scans tickers with minimal latency, while the Neural Hawkes Process employs continuous-time LSTM states and bilinear latent projections to model directed excitation among companies. An adaptive edge-pruning rule is implemented to ensure efficient graph updates during streaming.
Results
The proposed system achieves an end-to-end processing latency of approximately 13 milliseconds per incoming news record. Evaluation on the FNSPID corpus indicates a precision lift of 1.70 times over random predictions and 3.36 times over a same-sector heuristic at the 90th percentile next-day return threshold. The removal of the graph topology leads to a complete loss of predictive precision, confirming the effectiveness of the dynamic attention network.
Implications
This research has significant implications for financial forecasting, particularly in understanding how news events affect multiple companies simultaneously. The architecture can be applied to real-time trading systems and risk management frameworks, enhancing the ability to predict market movements based on news sentiment and cross-company relationships.
Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
Optimization
Theory
- Establishes new dimension-free lower bounds for higher-order smooth nonconvex optimization.
- Achieves matching lower bounds of Ω(ϵ−7/4) and Ω(ϵ−5/3) for Hessian-Lipschitz and third-order smooth functions, respectively.
- Introduces a novel block-chain mechanism for constructing hard instances that preserve smoothness.
- Closes the gap in existing literature regarding lower bounds for first-order oracle complexity.
Read more
Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
Summary
This paper investigates the deterministic first-order oracle complexity of locating ϵ-stationary points in smooth nonconvex optimization, particularly under higher-order smoothness conditions. While traditional methods achieve an optimal rate of ϵ−2 with Lipschitz gradients, the introduction of higher-order smoothness allows for accelerated rates: specifically, ϵ−7/4 with Lipschitz Hessians and ϵ−5/3 with Lipschitz third derivatives. The author addresses the previously unresolved issue of matching lower bounds for these accelerated rates by establishing a new dimension-free first-order lower bound applicable to higher-order smooth nonconvex functions. This construction achieves a matching Ω(ϵ−7/4) lower bound in the Hessian-Lipschitz scenario and Ω(ϵ−5/3) in the third-order-smooth case. The proposed lower-bound construction utilizes a block-chain mechanism that maintains the necessary smoothness structure while enforcing blockwise oracle revelation. The findings close the gap between existing upper and lower bounds, providing clarity on the optimal complexity for first-order methods in higher-order smooth optimization contexts.
Methodology
The author develops a new lower-bound construction for higher-order smooth nonconvex functions by employing a block-chain mechanism. This method allows for the enforcement of blockwise oracle revelation while maintaining the necessary smoothness properties. The construction is dimension-free and applicable to various orders of smoothness, providing a comprehensive analysis of the oracle complexity.
Results
The paper presents a new dimension-free first-order lower bound for higher-order smooth nonconvex functions, achieving Ω(ϵ−7/4) in the Hessian-Lipschitz case and Ω(ϵ−5/3) in the third-order-smooth case. These results match the best-known upper bounds, thus resolving a long-standing gap in the literature regarding the optimal first-order oracle complexity.
Implications
The findings have significant implications for the design of first-order optimization algorithms, particularly in scenarios where higher-order smoothness can be exploited. This work enhances the understanding of complexity in nonconvex optimization and may influence future research directions in algorithm development and theoretical analysis.
End-to-End Subgraph Detection with GraphDETR
Graph Learning
- GraphDETR reformulates subgraph detection as a set prediction problem, enhancing efficiency and scalability.
- The model employs a GNN for graph encoding and a transformer decoder for joint prediction of subgraph occurrences.
- GraphDETR supports both exact and approximate matching, broadening the scope of detectable patterns.
- Empirical results show high accuracy in detecting molecular functional groups, achieving an AP100 score of 91.2.
Read more
End-to-End Subgraph Detection with GraphDETR
Summary
The paper introduces GraphDETR, a novel deep learning framework for subgraph detection that reformulates the problem as a set prediction task, similar to the DETR model in object detection. Traditional combinatorial methods for subgraph isomorphism are limited by their NP-completeness and are typically applicable only to small patterns or graphs. GraphDETR overcomes these limitations by employing a graph neural network (GNN) to encode the target graph and utilizing a transformer decoder with a fixed set of learnable query vectors to predict all occurrences of query patterns in a single forward pass. This end-to-end training approach, facilitated by bipartite matching, allows for both exact and approximate matching of subgraphs, enabling the detection of diverse patterns such as molecular structures and fuzzy patterns. The framework is evaluated on molecular functional group detection using the ChEMBL dataset, demonstrating strong performance and showcasing the effectiveness of the set prediction paradigm in graph reasoning tasks.
Methodology
GraphDETR encodes target graphs using a graph neural network and employs a transformer decoder with a fixed set of learnable query vectors. The model is trained end-to-end using bipartite matching, allowing it to predict multiple subgraph occurrences simultaneously without duplicates.
Results
GraphDETR achieved an average precision score of 91.2 on the ChEMBL dataset for molecular functional group detection, demonstrating its capability to effectively identify diverse subgraph patterns in larger graphs.
Implications
The introduction of GraphDETR has significant implications for various scientific domains, particularly in molecular analysis and network science, where efficient and accurate subgraph detection is crucial for tasks such as reaction prediction and molecular property modeling.
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training
Reinforcement Learning
Large Language Models
Optimization
- Identifies the issue of divergent anchor bias in existing reinforcement learning methods for LLMs.
- Proposes a new algorithm, Evidence-Calibrated Policy Optimization (ECPO), to improve credit assignment.
- ECPO combines techniques to reduce the impact of statistical noise and improve training stability.
- Demonstrates superior performance of ECPO over existing methods on benchmark tasks.
Read more
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training
Summary
This paper addresses the challenges faced by long-horizon large language model (LLM) agents in reinforcement learning, particularly in assigning credit to intermediate decisions when rewards are sparse and delayed. The authors critique existing methods like GiGPO, which attempt to provide denser credit through step-level advantages at repeated anchor states. They identify a significant issue known as 'divergent anchor bias,' where rare but lucky actions receive disproportionately large advantages, leading to instability and oscillations in late-stage training. To overcome this, the authors propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free algorithm that calibrates step-level credit before policy updates. ECPO employs two key techniques: Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions to reduce the impact of low-count estimates, and Variance-Gated Credit Weighting, which mitigates the influence of noisy actions. Experiments conducted on ALFWorld and WebShop demonstrate that ECPO consistently outperforms strong baselines, including GiGPO, achieving significant improvements in success rates while maintaining low computational overhead.
Methodology
The authors developed Evidence-Calibrated Policy Optimization (ECPO), which is a critic-free policy optimization algorithm. ECPO utilizes Evidence-Calibrated Action Advantage to group rollouts by canonical actions and shrink low-count estimates, alongside Variance-Gated Credit Weighting to suppress the influence of noisy actions. This approach allows for more reliable credit assignment during policy updates.
Results
ECPO was tested on two environments, ALFWorld and WebShop, using models Qwen2.5-1.5B and Qwen2.5-7B. The results showed that ECPO improved success points by +5.2 on ALFWorld and +7.3 on WebShop compared to GiGPO, while only adding 0.1% additional advantage-computation overhead. This indicates a significant enhancement in training effectiveness and stability.
Implications
The findings suggest that improving credit assignment methods in reinforcement learning can lead to more stable and effective training of LLM agents, particularly in complex decision-making environments. This has potential applications in autonomous agents, robotics, and other fields requiring long-horizon planning.
SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter
Reinforcement Learning
Large Language Models
Optimization
- Identifies structural inefficiencies in GRPO-style updates leading to diminishing returns from additional rollouts.
- Introduces SALT, a method that reweights group-relative updates to reduce redundancy and improve learning signals.
- Demonstrates that effective updates can be enhanced without changing the underlying reward model or sampling methods.
- Validates the approach through comprehensive experiments showing consistent performance gains.
Read more
SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter
Summary
The paper addresses inefficiencies in Group Relative Policy Optimization (GRPO) in reinforcement learning with verifiable rewards (RLVR), where increasing the number of rollouts does not necessarily enhance learning. The authors identify that per-rollout policy-gradient features can become redundant and concentrate into a low-rank geometry, leading to substantial cancellation during aggregation. To mitigate this issue, they propose SALT (Subspace-Adaptive geometry pLug-in componenT), which reweights group-relative updates by estimating a dominant shared subspace from mini-batch gradient geometry. SALT decomposes group-relative coefficients into shared and residual channels, adaptively amplifying the residual channel to enhance effective updates. The effectiveness of SALT is validated through extensive experiments across various reasoning-oriented RLVR benchmarks and model scales, demonstrating improved update geometry and performance without altering the reward model or rollout sampling procedures.
Methodology
The authors propose SALT, which utilizes mini-batch gradient geometry to estimate a dominant shared subspace, decomposing group-relative coefficients into shared and residual components. By adaptively amplifying the residual channel, SALT aims to reduce signed aggregation cancellation and improve the effective update geometry.
Results
SALT was tested across diverse reasoning-oriented RLVR benchmarks and different model scales, showing significant improvements in effective update geometry and overall performance. The experiments confirmed that SALT effectively mitigates the issues of gradient cancellation and redundancy, leading to stronger learning outcomes.
Implications
The findings suggest that optimizing the aggregation of learning signals in reinforcement learning can lead to more efficient training processes, particularly in scenarios where large language models are employed. SALT could be applied to enhance various RL applications, improving their scalability and effectiveness.
Proper Scoring Rules for Right-Censored Survival Data
Theory
Time Series
- Introduces a unified framework for proper scoring rules under right censoring.
- Derives right-censored versions of several established scoring rules.
- Demonstrates that the marginalized score is proper under specific conditions.
- Presents censored engression as a new training method for multivariate survival data.
Read more
Proper Scoring Rules for Right-Censored Survival Data
Summary
This paper addresses the challenge of evaluating probabilistic forecasts in the context of right-censored survival data, where event times are only partially observed. The authors propose a novel framework for proper scoring rules that accounts for the censoring mechanism by first mapping the predictive distribution through this mechanism before applying the scoring rules. This approach leads to the development of localized scores for fixed censoring times and marginalized scores for random censoring. The authors derive right-censored versions of well-known scoring rules such as the CRPS, pinball loss, Brier score, and energy score, demonstrating that the marginalized score is proper under conditional independent censoring. Additionally, they introduce a new training objective called censored engression for multivariate right-censored survival modeling, which allows for flexible modeling of joint event-time distributions. Experimental results show that the proposed scores effectively rank oracle forecasts across various censoring regimes and that censored engression significantly outperforms naive training methods.
Methodology
The authors develop a framework that scores forecasts by applying the censoring mechanism to the predictive distribution before evaluation. They derive localized and marginalized scores based on this approach and extend the engression method to handle right-censored multivariate survival data.
Results
The proposed scoring rules correctly rank oracle forecasts across different censoring scenarios, while traditional forecast-dependent scores may lead to ranking reversals. Censored engression shows substantial improvements over naive training methods and performs competitively against likelihood-based benchmarks in synthetic experiments and a clinical case study.
Implications
The findings have significant implications for survival analysis in various fields, including healthcare and risk modeling, where accurate probabilistic forecasting of time-to-event outcomes is crucial. The methods can enhance the evaluation and training of models dealing with censored data, leading to better predictive performance.
The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning
Graph Learning
- Current evaluation methods in relational learning are biased due to reliance on flat leaderboards.
- Intrinsic geometry significantly impacts model performance and should be considered in evaluations.
- The proposed CURVBENCH framework stratifies datasets based on curvature, revealing critical performance insights.
- Model rankings can vary significantly across different curvature regimes, challenging the universality of model effectiveness.
Read more
The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning
Summary
This paper critiques the current evaluation practices in relational learning, which often rely on flat leaderboards that average performance across diverse datasets, assuming a uniform structure. The authors argue that this approach introduces systematic bias, obscuring geometry-dependent performance variations and leading to misleading conclusions about model generalization. They identify intrinsic geometry as a crucial factor influencing model effectiveness and demonstrate that conventional aggregated metrics mask significant performance trade-offs that become apparent only when datasets are stratified by their geometric properties. To address this issue, the authors propose a curvature-stratified evaluation framework, CURVBENCH, which categorizes datasets into positive, negative, and near-zero curvature regimes. They evaluate 18 models, including Graph Convolutional Networks (GCNs) and Graph Foundation Models (GFMs), across 14 datasets. The findings reveal that model rankings are stable within curvature regimes but can shift dramatically across them, indicating that performance is fundamentally geometry-dependent. The authors also highlight scenarios where GFMs show diminishing returns compared to geometry-aligned GNNs. They advocate for a geometry-aware evaluation protocol that provides more reliable comparisons than traditional benchmarks and release all associated code and datasets to facilitate reproducibility in future research.
Methodology
The authors developed a curvature-stratified evaluation framework, CURVBENCH, which partitions datasets into curvature regimes based on curvature statistics. They evaluated 18 models across 14 datasets, analyzing performance variations within and across these regimes to assess the geometry-dependent nature of model effectiveness.
Results
The evaluation revealed that model rankings are stable within curvature regimes but can shift significantly when moving across different regimes. The study found that GFMs often provide diminishing returns compared to simpler, geometry-aligned GNNs, emphasizing the importance of geometry in model assessment.
Implications
The findings suggest that future evaluations of relational learning methods should incorporate geometric considerations to avoid misleading conclusions. This could lead to more accurate assessments of model capabilities and guide the development of models tailored to specific geometric properties.
OPRD: On-Policy Representation Distillation
Large Language Models
NLP
Optimization
- OPRD is the first method to perform on-policy distillation in the hidden-state space rather than the output space.
- It eliminates sampling variance in gradient estimation, providing a more stable training signal.
- OPRD exposes rich structural information from the teacher's intermediate hidden states, enhancing the supervision signal.
- Empirical results show OPRD outperforms traditional methods on mathematics benchmarks and is more efficient in terms of training speed and memory usage.
Read more
OPRD: On-Policy Representation Distillation
Summary
The paper introduces On-Policy Representation Distillation (OPRD), a novel method that enhances on-policy distillation for large language models (LLMs) by shifting the focus from output space to hidden-state space. Traditional on-policy distillation methods, such as sampled-token and top-k variants, primarily supervise the student model by matching next-token log-probabilities, which leads to significant limitations. These include high sampling variance in gradient estimation and the loss of rich structural information contained in the teacher's intermediate hidden states. OPRD addresses these issues by aligning the student's intermediate representations with those of the teacher across selected layers during on-policy rollouts. This approach provides dense and deterministic supervision, effectively eliminating sampling variance and exposing valuable structural information that is typically discarded in output-space objectives. Theoretical analysis demonstrates that OPRD offers a richer supervision signal without additional rollout costs. Empirical results show that OPRD significantly closes the performance gap between student and teacher models on competitive mathematics benchmarks while also improving training speed and reducing memory usage compared to traditional output-space methods.
Methodology
OPRD aligns the student's intermediate hidden representations with the teacher's across selected transformer layers and response positions using a normalized mean-squared error objective during on-policy rollouts. This method bypasses the LM head entirely, allowing for a richer and more deterministic supervision signal.
Results
OPRD demonstrated superior performance on three competitive mathematics benchmarks (AIME 2024, AIME 2025, AIMO), closing the student-teacher performance gap significantly. It also achieved a 1.44x speed-up in training and reduced actor-update transient memory usage by up to 54% compared to top-k OPD methods.
Implications
The introduction of OPRD opens new avenues for improving the efficiency and effectiveness of large language model training, particularly in applications requiring high accuracy and reduced resource consumption. It suggests a shift in focus towards leveraging intermediate representations for better model performance.
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Reinforcement Learning
Large Language Models
Optimization
- AMC formulates RL for black-box agents as a Bayesian inference problem, allowing optimization without parameter access.
- The method uses Sequential Monte Carlo to sample from the optimal policy, guided by a learned value function.
- Empirical results show AMC outperforms traditional prompting methods and GRPO in various environments.
- The approach highlights the feasibility of applying RL concepts to closed-source agents, expanding their usability.
Read more
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Summary
The paper introduces Agentic Monte Carlo (AMC), a novel framework designed to optimize black-box agents, particularly those based on large language models (LLMs), which cannot be directly modified due to their proprietary nature. Traditional reinforcement learning (RL) methods require parameter access to compute policy gradients, which is not feasible with black-box models. The authors leverage the equivalence between RL and Bayesian inference to define the optimal policy as a Bayesian posterior over trajectories. AMC employs Sequential Monte Carlo (SMC) methods to sample from this posterior, using a learned value function to guide the agent's behavior without altering the underlying black-box model. The framework is validated across three diverse environments from the AgentGym benchmark, showing that AMC significantly outperforms prompting baselines and even surpasses Group Relative Policy Optimization (GRPO) as computational resources increase. This work demonstrates the potential for principled RL-style optimization of black-box LLM agents, providing a new avenue for enhancing their performance without direct access to their parameters.
Methodology
The authors propose Agentic Monte Carlo (AMC), which utilizes Sequential Monte Carlo (SMC) methods to sample optimal policies for black-box agents. The optimal policy is treated as a Bayesian posterior over trajectories, combining a fixed black-box prior with a likelihood term that encodes optimal behavior. A separate value function is trained to steer the agent towards optimal actions without modifying the black-box model.
Results
AMC was validated on three environments from the AgentGym benchmark: WebShop, SciWorld, and TextCraft. The results demonstrated that AMC consistently outperformed prompting baselines and, with increased computational resources, even surpassed GRPO baselines that require full parameter access.
Implications
The findings suggest that AMC can effectively optimize black-box LLM agents, making it a valuable tool for researchers and practitioners who work with proprietary models. This could lead to improved performance in various applications, including e-commerce, scientific experimentation, and interactive gaming.
Generative Criticality in Large Language Model Temperature Scaling
NLP
Large Language Models
Theory
- Introduction of a statistical-field framework for LLM outputs with defined physical observables.
- Evidence of critical behavior in LLM text generation driven by temperature scaling.
- Independent geometric validation of criticality through the TwoNN intrinsic dimension method.
- Findings are robust across different model scales and prompt categories.
Read more
Generative Criticality in Large Language Model Temperature Scaling
Summary
This paper introduces a statistical-field framework for analyzing text generated by large language models (LLMs), conceptualizing token embeddings as continuous spin variables on a one-dimensional chain. The authors define a susceptibility based on the connected two-point correlator and an order parameter derived from the ensemble-averaged embedding field. By varying the softmax temperature T, they observe a sharp peak in susceptibility near a critical temperature Tc, alongside a rapid change in the order parameter and a collapse onto a single semantic direction below Tc. The intrinsic dimension, estimated using the TwoNN method, supports these findings, revealing a minimum near Tc. The results are consistent across various model scales and prompt categories, suggesting a critical behavior reminiscent of continuous phase transitions. However, the non-equilibrium nature of autoregressive generation indicates the need for further exploration. This framework offers quantitative tools for examining the collective statistical structure of LLM outputs and hints at connections between decoding strategies and critical phenomena.
Methodology
The authors apply statistical field theory and intrinsic dimension estimation to analyze LLM outputs. They define susceptibility and an order parameter to study their behavior as a function of the softmax temperature T. The TwoNN method is employed to estimate the intrinsic dimension of the generated text, identifying critical regions through non-monotonic features.
Results
The analysis reveals a pronounced peak in susceptibility near the critical temperature Tc, with power-law scaling observed. The order parameter shows a rapid transition from a disordered to an ordered phase as T approaches Tc. The intrinsic dimension estimation corroborates these findings, indicating a minimum at Tc, suggesting critical behavior in the generation process.
Implications
The proposed framework can enhance the understanding of LLM outputs, providing insights into the collective statistical structure of generated text. It may also inform the development of more effective decoding strategies and improve the interpretability of LLM behavior.
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Optimization
Theory
Efficient ML
- DP-MacAdam is the first algorithm to combine adaptive clipping and adaptive momentum under differential privacy.
- The algorithm uses a novel bias correction factor for unbiased gradient variance estimation.
- Empirical results show improved performance over DP-SGD, AdaClip, and DP-Adam across various privacy budgets.
- DP-MacAdam does not require manual tuning of clipping thresholds, simplifying its application.
Read more
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Summary
The paper introduces DP-MacAdam, a novel algorithm that integrates adaptive clipping and adaptive momentum within the framework of differential privacy for stochastic gradient descent (DP-SGD). Traditional DP-SGD relies on a fixed gradient clipping threshold, which can hinder performance by either discarding useful information or increasing sensitivity. The authors address this limitation by combining AdaClip, which adaptively clips gradients based on empirical statistics, with DP-Adam, which utilizes momentum updates based on gradient estimates. DP-MacAdam employs the same mean and variance estimates for both clipping and momentum, ensuring that it operates under differential privacy without additional privacy costs. The authors provide a bias correction factor for estimating gradient variance, ensuring unbiased results despite the noise introduced by differential privacy. Empirical evaluations demonstrate that DP-MacAdam outperforms existing state-of-the-art differentially private optimizers on benchmark datasets like MNIST and CIFAR-10, achieving better model utility without the need for manual tuning of clipping thresholds.
Methodology
The authors develop DP-MacAdam by integrating AdaClip's adaptive clipping strategy with DP-Adam's momentum updates. They derive a bias correction factor to ensure unbiased gradient variance estimates, which are used for both clipping and momentum calculations. The algorithm operates on privatized gradients, ensuring compliance with differential privacy standards.
Results
DP-MacAdam demonstrates superior performance in terms of model utility compared to existing differentially private optimizers like DP-SGD, AdaClip, and DP-Adam on datasets such as MNIST and CIFAR-10. The algorithm achieves this without the need for manual tuning of clipping thresholds, making it more user-friendly.
Implications
The development of DP-MacAdam has significant implications for privacy-preserving machine learning, particularly in applications involving sensitive data. Its ability to improve model performance while maintaining strong privacy guarantees could enhance the adoption of machine learning in fields such as healthcare, finance, and any domain where data privacy is paramount.
Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting
Time Series
- EVIDENT integrates Bayesian training and evidence-based ranking for neural architecture selection.
- The framework identifies the lowest-capacity model that meets validation criteria, improving generalization.
- EVIDENT demonstrates effectiveness in blood glucose forecasting for type 1 diabetes patients.
- The approach outperforms random-search methods by selecting smaller, more consistent architectures.
Read more
Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting
Summary
The paper addresses the challenge of reliable neural architecture selection for time-series forecasting, particularly in scenarios with limited, noisy, and heterogeneous data. The authors propose a novel framework called EVIDENT (EVidence-based IDEntification of Neural ar-chitectures), which integrates Bayesian training, evidence-based ranking, and task-specific validation under uncertainty. This framework systematically evaluates candidate architectures to identify the lowest-capacity model that meets a specified validation criterion. The authors apply EVIDENT to temporal convolutional networks (TCNs) for individualized blood glucose forecasting in type 1 diabetes patients. The results demonstrate that EVIDENT effectively rejects both under- and over-parameterized architectures while identifying models that generalize well to unseen patients. Additionally, when multiple architectures are competitive, the framework supports plausibility-weighted ensemble predictions, enhancing overall predictive performance. Compared to a random-search baseline, EVIDENT consistently identifies smaller architectures with more reliable forecasting capabilities, establishing it as a promising strategy for neural architecture discovery in high-stakes forecasting applications.
Methodology
The methodology involves a systematic exploration of candidate architectures using Bayesian evidence to rank models based on their plausibility. The EVIDENT framework decouples model ranking from validation, allowing for task-specific assessments that account for predictive uncertainty. Temporal convolutional networks (TCNs) are employed for the forecasting task, mapping historical glucose, meal intake, and insulin delivery data to future glucose trajectories.
Results
EVIDENT successfully identifies TCN architectures that generalize well to unseen patients, rejecting both under- and over-parameterized models. The framework's performance is validated against a random-search baseline, showing that it selects smaller architectures with more consistent forecasting performance. The plausibility-weighted ensemble predictions further enhance the predictive accuracy.
Implications
The findings suggest that EVIDENT can be a valuable tool for neural architecture selection in biomedical applications, particularly in scenarios where data is limited and variability is high. This approach may lead to more reliable forecasting models in critical healthcare settings, enabling better decision-making for patient-specific treatments.
Performance Evaluation of GraphCast for Medium-Range Weather Forecasting over Brazil
Time Series
- First regime-stratified benchmark of an MLWP model over Brazil.
- Utilization of a cloud-native evaluation pipeline for harmonizing data.
- Identification of conditions under which GraphCast gains or loses skill relative to traditional models.
- Establishment of operational boundaries for future optimization of AI weather models.
Read more
Performance Evaluation of GraphCast for Medium-Range Weather Forecasting over Brazil
Summary
This paper evaluates the performance of the GraphCast machine learning weather prediction model in comparison to the deterministic ECMWF Integrated Forecasting System (IFS HRES) across four distinct climatic sub-regions in Brazil. The study highlights the lack of regional benchmarks for MLWP models in the Global South, particularly in complex and convective environments. Using a scalable cloud-native pipeline and the WeatherBench-X framework, the authors assess key tropospheric variables (temperature, specific humidity, and geopotential) over different seasonal windows. The results reveal a regime-dependent skill profile, with GraphCast underperforming in the medium range during the austral winter but excelling in the extended range. Conversely, during the austral summer, GraphCast effectively captures large-scale moisture transport while dampening high-frequency variability. The findings establish a baseline for future 'tropicalization' efforts to optimize AI models for regional resilience.
Methodology
The study employs a cloud-native evaluation pipeline built on the WeatherBench-X framework, assessing selected tropospheric variables (T850, Q850, Z500) across four seasonal windows. The performance of GraphCast is compared against the ECMWF IFS HRES model, with statistical metrics such as Root Mean Square Error (RMSE) and Anomaly Correlation Coefficient (ACC) used for verification against operational IFS analysis.
Results
The evaluation indicates that GraphCast exhibits a regime-dependent skill profile. In the austral winter, it underperforms in medium-range forecasts for Z500 but shows improved performance in the extended range. In contrast, during the austral summer, GraphCast effectively captures large-scale moisture transport while dampening high-frequency variability that affects temperature forecasts.
Implications
The findings have significant implications for enhancing weather forecasting in Brazil, particularly in sectors sensitive to weather conditions such as agriculture, energy, and transportation. The study also sets the stage for future research aimed at optimizing MLWP models for diverse climatic regimes in the Global South.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Reinforcement Learning
Large Language Models
Optimization
- CERO is the first rollout-allocation framework for LLM post-training that optimizes a global rollout budget across epochs.
- The method uses Bayesian estimates of prompt-level rollout value to guide adaptive budgeting.
- CERO demonstrates improved sample efficiency compared to traditional fixed allocation methods.
- Theoretical guarantees provide a strong foundation for the proposed approach.
Read more
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Summary
This paper introduces CERO, a novel framework for adaptive rollout optimization in reinforcement learning (RL) post-training of large language models (LLMs). Traditional methods allocate a fixed number of rollouts per prompt, which fails to account for the varying training signals provided by different prompts. CERO addresses this inefficiency by formulating the problem as an online resource allocation task with prompt-level diminishing returns. It maintains a Beta posterior over each prompt's success probability and uses this to estimate the value of additional rollouts. The framework couples decisions across prompts and epochs through a shared global budget, optimizing the allocation of rollouts to maximize utility. The authors derive a Fenchel-dual reformulation of the problem and implement an online primal-dual algorithm that updates prompt-level and budget-level variables. Theoretical analysis shows an O(√K) regret bound against offline benchmarks, and empirical results demonstrate that CERO outperforms existing methods like GRPO in various mathematical reasoning tasks, highlighting its effectiveness in improving sample efficiency.
Methodology
CERO employs a Bayesian approach to maintain a Beta posterior over each prompt's success probability, using this to estimate the value of additional rollouts. It formulates the allocation problem as a budgeted online optimization task with diminishing returns and derives a Fenchel-dual reformulation to implement an online primal-dual algorithm. This algorithm updates both prompt-level and budget-level variables through projected online gradient descent.
Results
CERO achieves an O(√K) regret bound against offline allocation benchmarks and consistently outperforms GRPO across various open-weight LLMs and mathematical reasoning tasks, demonstrating its superior sample efficiency.
Implications
The findings suggest that adaptive rollout budgeting can significantly enhance the efficiency of RL post-training for LLMs, potentially leading to better performance in tasks requiring reasoning and decision-making. This approach could be applied to other areas of machine learning where resource allocation is critical.
PAC-Bayesian Adversarially Robust Generalization for Message Passing Graph Neural Networks: A Sensitivity Analysis
Graph Learning
Theory
- Introduces a sensitivity-aware PAC-Bayesian framework for MPGNNs.
- Derives tighter robust generalization bounds by analyzing output Jacobians.
- Utilizes anisotropic Gaussian posteriors to improve KL divergence bounds.
- Reduces complexity terms from hidden-width-dependent to class-dependent.
Read more
PAC-Bayesian Adversarially Robust Generalization for Message Passing Graph Neural Networks: A Sensitivity Analysis
Summary
This paper addresses the vulnerability of Graph Neural Networks (GNNs) to adversarial attacks, focusing on the understanding of robust generalization in adversarial settings. The authors extend the PAC-Bayesian framework to derive tighter generalization bounds for Message Passing Graph Neural Networks (MPGNNs). They highlight the limitations of existing analyses that rely on isotropic Gaussian posteriors and full parameter space perturbations, which fail to capture heterogeneous parameter sensitivity. By quantifying the sensitivity of perturbations across different parameter blocks through output Jacobians, the authors construct Jacobian-aligned sensitivity matrices and utilize anisotropic Gaussian posteriors with optimized covariances. This approach refines the spectral-norm dependence on learned weights, resulting in significantly tighter robust generalization guarantees. The findings provide insights into enhancing adversarial robustness in MPGNN designs.
Methodology
The authors extend the PAC-Bayesian framework by deriving output Jacobians to quantify parameter sensitivity. They construct Jacobian-aligned sensitivity matrices and employ anisotropic Gaussian posteriors with optimized covariances to achieve tighter bounds on KL divergence, thus enhancing the robustness analysis of MPGNNs.
Results
The proposed methodology yields much tighter robust generalization guarantees for MPGNNs compared to existing approaches. The analysis demonstrates that the complexity terms can be significantly reduced, leading to improved understanding and design of GNNs under adversarial conditions.
Implications
The findings suggest that by refining the sensitivity analysis of GNNs, practitioners can design more robust models against adversarial attacks, which is crucial for applications in social networks, biological modeling, and other graph-related tasks where reliability is paramount.
Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN
Time Series
Interpretability
Optimization
- Introduces a method for event detection to support dependency learning in AI-RAN.
- Develops a synthetic traffic generator to simulate parameter-KPI relationships.
- Demonstrates the effectiveness of a machine-learning pipeline for recovering dependency structures.
- Identifies threshold calibration as a key factor in event detection quality.
Read more
Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN
Summary
The paper addresses the challenge of managing interactions among multiple AI-driven control functions in next-generation wireless networks, particularly in AI-integrated architectures like AI-RAN and O-RAN. It emphasizes the need for a reliable dependency structure that captures the influence of control parameters on network performance outcomes, represented by key performance indicators (KPIs). The authors propose a methodology for event detection that converts noisy telemetry data into binary indicators of parameter activity and KPI responses. This is crucial because not all fluctuations in data signify genuine control interactions; thus, distinguishing real relationships from background noise is essential. Due to the scarcity of real AI-RAN traffic data with known parameter-KPI relationships, the authors introduce a synthetic closed-loop traffic generator that simulates these dependencies. They evaluate a machine-learning-based pipeline that formulates the conversion of continuous telemetry into binary indicators as a significance-detection problem. Experimental results demonstrate that the proposed method effectively recovers latent dependency structures when the signal is sufficiently distinct from noise, with threshold calibration identified as a critical factor influencing detection quality. This work lays the groundwork for interpretable dependency learning, which is vital for adaptive control systems in AI-RAN.
Methodology
The authors developed a synthetic closed-loop traffic generator to create telemetry data with known parameter-KPI dependencies. They formulated the event detection as a significance-detection problem and employed a machine-learning-based pipeline to convert continuous telemetry into binary event indicators, focusing on distinguishing genuine control interactions from background noise.
Results
The experimental evaluation showed that the proposed pipeline reliably recovers latent dependency structures from noisy telemetry when the signal is adequately separated from background variation. The results underscore the importance of threshold calibration in enhancing event detection quality.
Implications
The findings have significant implications for the design and management of AI-driven control systems in wireless networks, enabling better understanding and monitoring of parameter-KPI relationships, which can lead to improved network performance and conflict mitigation in multi-control-point settings.
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
Theory
Optimization
- Discrete Gradient Descent with large step sizes induces pathway re-balancing rather than symmetry breaking.
- Single-path solutions are sharp minima, while balanced solutions across pathways are flatter.
- Training dynamics exhibit two phases: initial symmetry breaking followed by a re-balancing phase due to oscillations.
- The relationship between the number of pathways, depth, and sharpness of minima is theoretically derived.
Read more
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
Summary
This paper investigates the training dynamics of Deep Linear Networks (DLNs) with multi-pathways, focusing on the effects of discrete Gradient Descent (GD) with large step sizes. Previous analyses using Gradient Flow (GF) suggested that DLNs undergo a 'winner-takes-all' specialization, leading to symmetry breaking where features concentrate in single pathways. However, the authors demonstrate that GD with large step sizes leads to a different phenomenon termed 'pathway re-balancing.' They prove that single-path solutions correspond to sharp minima, while distributing signals across pathways results in flatter minima. As training progresses, the network initially exhibits symmetry breaking but later enters a re-balancing phase where oscillations at the Edge of Stability drive the network to redistribute signals across pathways. This work clarifies how depth influences pathway competition and suggests that large-step GD favors shared representations rather than persistent single-pathway dominance, challenging existing assumptions based on GF.
Methodology
The authors analyze the geometry of the loss landscape in deep linear networks with multiple pathways. They derive theoretical results regarding the sharpness of minima based on the number of pathways and network depth, and they explore the dynamics of training under large step sizes using discrete Gradient Descent.
Results
The study proves that balancing signals across multiple pathways reduces sharpness by a factor dependent on the number of pathways and depth. It identifies two distinct training phases: an initial symmetry-breaking phase followed by a re-balancing phase where the network flattens its loss landscape. Additionally, an upper bound on the learning rate is derived to ensure stability during training.
Implications
The findings have implications for understanding the optimization dynamics in deep learning, particularly in multi-pathway architectures. They suggest that the choice of optimization algorithms and learning rates can significantly influence the final network structure and performance, prompting a reconsideration of existing theories based on continuous-time analyses.
Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data
NLP
Large Language Models
Efficient ML
- Introduces a hybrid framework combining LoRA fine-tuning and rule-based post-processing for structured evaluation tasks.
- Achieves 100% JSON structural validity and high accuracy on compliance evaluations with minimal training data.
- Demonstrates significant cost savings and reduced latency compared to frontier large language models.
- Utilizes targeted hard-negative augmentation to improve model performance on critical decision boundaries.
Read more
Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data
Summary
This paper presents a novel framework for deploying small language models (specifically LLaMA 3.1 with 8 billion parameters) for domain-specific structured evaluation tasks, particularly in compliance evaluation of conversational transcripts. The authors address the challenges posed by large language models (LLMs) such as high latency, cost, and data privacy concerns. They propose a hybrid approach that combines fine-tuning using Low-Rank Adaptation (LoRA) on a limited dataset of 219 examples with a deterministic rule-based post-processing layer. This method achieves significant improvements in operational efficiency, yielding 100% structural validity in JSON outputs and 83% overall accuracy validated by human experts. The system operates at a cost of $0.013 per evaluation, which is substantially lower than proprietary alternatives, and completes inference in approximately 2 seconds, making it 2-5 times faster than existing LLM APIs. The paper also introduces targeted hard-negative augmentation to enhance model performance on critical decision boundaries, demonstrating that small domain-adapted models can achieve comparable accuracy to larger models while minimizing costs and latency.
Methodology
The methodology involves a two-stage hybrid inference pipeline. The first stage uses a fine-tuned small language model (LLaMA 3.1) to generate structured JSON outputs, while the second stage applies deterministic rule-based post-processing to enforce compliance with operational standards. The model is fine-tuned using LoRA on a small dataset, augmented with hard-negative examples to enhance decision boundary accuracy.
Results
The proposed system achieves 100% structural validity in JSON outputs, 83% overall accuracy in human-validated evaluations, and 100% accuracy on the most critical classification field. It operates with a latency of approximately 2 seconds per evaluation and at a cost of $0.013, resulting in 46-76% savings compared to proprietary alternatives.
Implications
The findings suggest that small language models can be effectively adapted for domain-specific tasks, providing a cost-efficient and privacy-preserving alternative to larger models. This approach can be applied in regulated industries such as finance and healthcare, where compliance and operational standards are critical.
Regret Minimization with Adaptive Opponents in Repeated Games
Theory
Reinforcement Learning
Optimization
- Introduction of Repeated Policy Regret (RP-Regret) as a new metric for regret in repeated games.
- Establishment of necessary conditions for achieving sublinear RP-Regret.
- Development of three algorithms for minimizing RP-Regret in non-convex strategy spaces.
- Demonstration of improved cooperative solutions and higher utility through RP-Regret minimization.
Read more
Regret Minimization with Adaptive Opponents in Repeated Games
Summary
This paper introduces a novel metric for regret minimization in repeated games, termed Repeated Policy Regret (RP-Regret), which accounts for adaptive opponents that can respond based on the history of play. The authors argue that traditional external regret metrics fail to capture the dynamic nature of interactions in repeated games, leading to suboptimal equilibria. RP-Regret measures the difference between the realized utility and the best-in-hindsight utility, allowing for time-varying comparator strategies and adaptive opponents. The paper establishes necessary conditions for achieving sublinear RP-Regret and proposes three algorithms to minimize it: one utilizing an optimization oracle, another minimizing a convex surrogate of RP-Regret, and a third that directly minimizes RP-Regret under slowly changing opponent strategies. Experiments demonstrate that minimizing RP-Regret can lead to more cooperative outcomes and higher utility in games like Stag-Hunt, highlighting the potential for improved equilibria in repeated game scenarios.
Methodology
The authors propose a new regret metric (RP-Regret) and derive necessary conditions for its minimization. They develop three algorithms: one based on an optimization oracle, one minimizing a convex surrogate, and one for slowly changing strategies. The algorithms are tested through experiments in repeated games.
Results
The paper shows that by minimizing RP-Regret, players can achieve subgame perfect equilibria and higher utility outcomes in repeated games, particularly in cooperative scenarios like Stag-Hunt.
Implications
The findings suggest that incorporating adaptive opponent behavior into regret minimization can lead to better strategies and outcomes in repeated games, with potential applications in algorithmic game theory and multi-agent systems.
Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data
Theory
Interpretability
- Counterfactuals can be exploited for privacy attacks, similar to synthetic data.
- Membership inference attacks can be conducted on counterfactuals without model access.
- The study bridges the gap between synthetic data privacy research and counterfactual analysis.
- An ensemble MIA is proposed and compared with existing counterfactual distance attacks.
Read more
Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data
Summary
This paper investigates the privacy implications of counterfactuals in machine learning, particularly focusing on how they can be exploited through membership inference attacks (MIAs). Counterfactuals are often used to explain model decisions by illustrating how changes to user profiles can lead to different outcomes. However, they can also inadvertently reveal sensitive information about the model or its training data. The authors draw parallels between counterfactuals and synthetic data, suggesting that MIAs developed for synthetic data can be adapted to counterfactuals. The study demonstrates that it is possible to conduct MIAs on counterfactuals without direct access to the underlying model, a significant shift from traditional approaches that require model queries. The authors implement an ensemble MIA against counterfactuals generated by various state-of-the-art mechanisms and compare its effectiveness to a counterfactual distance attack specifically designed for counterfactuals. The findings highlight the need for caution among model developers when releasing counterfactuals, as they can lead to privacy breaches.
Methodology
The authors implemented an ensemble membership inference attack (MIA) against counterfactuals generated by various counterfactual generation mechanisms. They compared the effectiveness of this approach with a counterfactual distance attack, which is specifically designed for counterfactuals. The study operates in a no-box setting, meaning that the adversary only has access to the generated counterfactuals without querying the model.
Results
The results indicate that the ensemble MIA is effective in inferring membership from counterfactuals, demonstrating that privacy breaches can occur even without direct model access. This finding suggests that existing protections against MIAs may be insufficient when it comes to counterfactuals.
Implications
The implications of this research are significant for developers and practitioners in machine learning, particularly in high-stakes decision-making contexts. It underscores the importance of considering privacy risks when generating and releasing counterfactuals, as they can inadvertently expose sensitive information about the training data or model.
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Reinforcement Learning
Robotics
Efficient ML
- Representation learning is more critical than model-based control for scalable multitask RL.
- MR.Q, a model-free algorithm, integrates predictive objectives and achieves superior performance.
- The approach significantly reduces computational overhead while improving sample efficiency.
- Predictive representation learning is essential for performance, as shown through ablation studies.
Read more
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Summary
This paper addresses the challenge of scaling reinforcement learning (RL) to diverse multitask settings, arguing that representation learning is the key driver for scalability rather than model-based control. The authors introduce MR.Q, a model-free algorithm that integrates predictive objectives into a scalable actor-critic architecture. This approach demonstrates strong performance across various multitask continuous control tasks, outperforming a recent world-model-based method and several deep RL baselines while reducing computational overhead. The study emphasizes the importance of predictive representation learning, showing that it significantly enhances performance and sample efficiency, particularly in multitask environments. The findings suggest that improving representation quality is crucial for effective multitask learning in deep RL, offering a new perspective on the role of representation in scaling RL methods.
Methodology
The authors developed MR.Q, a model-free reinforcement learning algorithm that incorporates auxiliary predictive objectives into a scalable actor-critic framework. They evaluated MR.Q against a suite of multitask continuous control benchmarks, focusing on sample efficiency and performance at reduced training steps (10M environment steps). The study included ablation experiments to assess the impact of predictive objectives on performance.
Results
MR.Q outperformed a recent world-model-based method (Newt) and various deep RL baselines across multitask benchmarks. It demonstrated improved wall-clock efficiency and sample efficiency, with stronger transfer capabilities to unseen tasks. The results indicated that predictive objectives are crucial for maintaining performance, especially at larger model scales.
Implications
The findings suggest that focusing on representation quality can lead to more efficient and scalable multitask learning in deep reinforcement learning. This could have applications in robotics, game playing, and other domains where multitask learning is essential.
Scaling Laws for Behavioral Foundation Models over User Event Sequences
Optimization
Theory
Efficient ML
- The compute-optimal event embedder size is approximately 2% of total parameters across various compute budgets.
- Behavioral scaling initially favors data-heavy training but approaches the Chinchilla heuristic at higher compute levels.
- The evaluation metric is integral to the scaling law, affecting optimal configurations for batch size and negative sampling.
- Negative sampling becomes a memory constraint at higher compute budgets rather than a compute constraint.
Read more
Scaling Laws for Behavioral Foundation Models over User Event Sequences
Summary
This paper investigates the scaling laws applicable to behavioral foundation models, which are increasingly used in various domains such as recommendations, payments, and fraud detection. The authors focus on a two-part model architecture consisting of a feature-based event embedder and a decoder-only transformer. Through approximately 600 experimental runs, they explore how different parameters, including the size of the embedder, batch size, model/data allocation, and negative sampling, affect model performance across a range of compute budgets (1015–1019 training FLOPs). The findings reveal that a small embedder (approximately 2% of total parameters) is optimal across all tested budgets, as it is more cost-effective per step and handles repeated items more efficiently than the contextualizer. The study also highlights that the compute-optimal data-to-parameter ratio shifts from being data-heavy at low compute to aligning with the Chinchilla heuristic at higher compute levels. Furthermore, the choice of evaluation metric significantly influences the scaling laws, indicating that practitioners must consider metric-dependent strategies when optimizing their models. Overall, the paper provides valuable insights into the calibration of behavioral foundation models, offering guidelines for practitioners in the field.
Methodology
The authors conducted a series of experiments using a two-part behavioral-model architecture, consisting of a feature-based event embedder and a decoder-only transformer. They varied key parameters across a wide range of compute budgets (1015–1019 FLOPs) and evaluated the impact of these variations on model performance using a shared evaluation pipeline.
Results
The study found that a small embedder is optimal for all tested budgets, with the data-to-parameter ratio decreasing significantly as compute increases. The critical batch size and optimal negative count were shown to depend on the chosen evaluation metric, indicating that the scaling laws are not uniform across different metrics. At the highest compute levels, the active constraint shifted from compute to memory for negative sampling.
Implications
The findings suggest that practitioners should carefully consider the size of the embedder and the choice of evaluation metrics when training behavioral foundation models. The insights into scaling laws can help optimize model performance and resource allocation in real-world applications.
Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events
Theory
Optimization
- Flux Matching provides a variational characterization of reactive path ensembles through current velocity and scalar potential.
- The framework is robust against non-Markovian projections, unlike traditional committor-based methods.
- It offers data-driven reaction coordinates that enhance adaptive sampling methods.
- Numerical validation shows its applicability across different molecular systems.
Read more
Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events
Summary
This paper introduces a novel framework called Flux Matching, which aims to facilitate the discovery of mechanisms and adaptive sampling of rare events in various scientific domains, including computational chemistry and climate science. The authors propose a method to extract mechanistic insights from reactive trajectory data by learning two complementary objects: a current velocity field (u) that delineates dominant reaction pathways, and a scalar potential (h) that serves as a data-driven reaction coordinate. Both u and h are derived from a weighted Helmholtz–Hodge decomposition of the reactive current and are estimated through regression from trajectory data, requiring no prior knowledge of the underlying dynamics. Unlike traditional committor-based methods, the proposed framework maintains its validity under non-Markovian projections, making it applicable to high-dimensional systems. The level sets of h provide adaptive interfaces for enhanced sampling techniques, enabling iterative refinement of the sampling process. The authors validate their approach through numerical experiments on various molecular systems, demonstrating its effectiveness in generating current velocity trajectories and calculating rate constants.
Methodology
The methodology involves a weighted Helmholtz–Hodge decomposition of the reactive current to derive the current velocity and scalar potential. The framework employs regression techniques to estimate these quantities directly from reactive trajectory data, minimizing quadratic functionals over the reactive path ensemble. This approach allows for the extraction of mechanistic insights without requiring knowledge of the underlying dynamics.
Results
The authors successfully demonstrated the Flux Matching framework on various systems, including the Müller–Brown potential and molecular systems like Alanine Dipeptide (ADP) and AIB9. The results indicated that the learned current velocity trajectories effectively captured the dominant reaction pathways, and the scalar potential provided a reliable reaction coordinate for adaptive sampling.
Implications
The implications of this work extend to various fields where understanding rare transitions is crucial, such as computational chemistry, biology, and climate science. The ability to derive data-driven reaction coordinates and improve sampling techniques can lead to more efficient simulations and deeper insights into complex systems.
Generalized TV–ℓp Structured Priors for Bayesian T1 Mapping
Theory
Optimization
Computer Vision
- Introduction of a generalized TV–ℓp prior for Bayesian T1 mapping.
- Demonstrated properness of the prior and its effectiveness in uncertainty quantification.
- Evaluation against multiple existing methods shows superior performance in terms of reliability and accuracy.
- Results indicate reduced uncertainty and improved spatial coherence in T1 maps.
Read more
Generalized TV–ℓp Structured Priors for Bayesian T1 Mapping
Summary
This paper presents a novel family of structured spatial priors that integrates total variation (TV) with ℓp norms for Bayesian T1 mapping in MRI. The proposed TV–ℓp prior is shown to be a proper prior distribution, enhancing spatial coherence and smoothness in parameter estimation. The authors employ a Bayesian regression framework, utilizing the No-U-Turn Sampler (NUTS) for posterior inference, allowing for effective uncertainty quantification. The methodology is evaluated against maximum-likelihood estimation and various Bayesian alternatives, including uniform, Gamma, and bounded TV priors. Experiments conducted on synthetic brain and cardiac T1 mapping datasets, as well as a real in-vivo breast T1 mapping dataset, demonstrate that the TV–ℓp prior leads to more concentrated posterior densities, reduced uncertainty, lower variance, and smaller negative bias in estimates. This approach significantly improves the reliability of T1 mapping, making it a robust tool for medical imaging applications.
Methodology
The authors developed a Bayesian regression framework incorporating a generalized TV–ℓp prior, utilizing the No-U-Turn Sampler (NUTS) for posterior inference. The method was compared with maximum-likelihood estimation and various Bayesian priors through experiments on synthetic and real datasets.
Results
The TV–ℓp prior resulted in more concentrated posterior densities, indicating reduced uncertainty. It consistently achieved lower variance and smaller negative bias compared to other methods, leading to more reliable T1 mapping estimates.
Implications
The proposed method enhances the accuracy and reliability of T1 mapping in MRI, which could improve diagnostic capabilities in clinical settings. The approach also provides a framework for uncertainty quantification in other medical imaging applications.
Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
Large Language Models
Optimization
Efficient ML
- Discovery of a dominant-layer phenomenon in ZO fine-tuning, where tuning a single layer can recover or exceed full-model performance.
- The dominant layer is task-agnostic but model-specific, identified efficiently through activation outlier analysis.
- Perturbation effects propagate effectively through the dominant layer, enhancing optimization signals under ZO updates.
- Dominant-layer ZO fine-tuning shows improved performance and training speed compared to existing methods.
Read more
Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
Summary
This paper investigates the phenomenon of zeroth-order (ZO) optimization in fine-tuning large language models (LLMs), revealing that a single decoding layer, termed the 'dominant layer', significantly influences the adaptation process. The authors demonstrate that fine-tuning this dominant layer alone can match or surpass the performance of full-model ZO fine-tuning across various LLM families and tasks. The dominant layer is identified as task-agnostic yet model-specific, aligning with the first activation-outlier layer in the pre-trained model. The study further explores the propagation of perturbation effects under ZO optimization, attributing the dominant layer's effectiveness to its high sensitivity to perturbations and its early position in the residual stream, which amplifies the optimization signals during forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks validate these findings, showing that dominant-layer ZO fine-tuning improves performance while achieving significant training speedups.
Methodology
The authors conducted a systematic layer-wise analysis of multiple LLMs, fine-tuning one layer at a time while freezing others. They identified the dominant layer through a simple inference-only analysis of activation outliers, which indicated the layer's perturbation sensitivity and its position in the model's architecture. The study also involved extensive experiments to compare the performance of dominant-layer ZO fine-tuning against full-model and LoRA-based ZO fine-tuning.
Results
The experiments demonstrated that dominant-layer ZO fine-tuning improved average performance by 0.86% over full-model ZO fine-tuning and 0.61% over LoRA-based ZO fine-tuning. Additionally, it achieved training speedups ranging from 1.12× to 4.52× compared to full-model MeZO.
Implications
The findings suggest that focusing on a single dominant layer can significantly reduce training costs while maintaining model performance. This insight could lead to more efficient ZO optimization strategies and enhance understanding of layer-wise contributions in LLMs.
Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning
Large Language Models
Optimization
NLP
- Introduction of a perceptive LLM routing paradigm that learns user preferences through interaction.
- Development of MetaRouter, a meta-learning framework for preference-aware LLM routing.
- Demonstrated superior performance of MetaRouter over existing routing methods.
- High efficiency in learning user preferences and adaptability to different LLMs.
Read more
Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning
Summary
This paper addresses the challenge of efficiently routing queries to large language models (LLMs) based on user-specific cost-performance preferences. The authors introduce a novel perceptive LLM routing paradigm that learns users' implicit preferences through minimal interaction, overcoming limitations of existing methods that rely on manual configurations or retraining for each user. The proposed MetaRouter framework employs meta-learning to treat distinct user preferences as separate tasks within a contextual bandit framework. During the meta-training phase, the router adapts to diverse preference profiles, allowing it to infer user preferences from pairwise comparisons of LLM responses. This approach enables personalized routing decisions that optimize the selection of LLMs based on individual user needs. Experimental evaluations demonstrate that MetaRouter outperforms strong baselines in both in-distribution and out-of-distribution tasks, showcasing its efficiency in learning user preferences, robustness to changes in routable LLMs, and scalability for multi-model routing.
Methodology
The authors formulated the routing decision as a contextual bandit problem, treating distinct user preferences as different tasks. MetaRouter employs meta-learning to train on diverse preference profiles, allowing for rapid adaptation. User feedback is collected through pairwise comparisons of LLM responses, which is then used to infer a latent preference representation that informs routing decisions.
Results
MetaRouter was validated against strong baseline models and showed significant improvements in performance metrics across both in-distribution and out-of-distribution datasets. The framework demonstrated high efficiency in learning user preferences and maintained robustness to changes in the routable LLMs, indicating its effectiveness in real-world applications.
Implications
The findings suggest that personalized LLM routing can enhance user experience by optimizing the balance between performance and cost. This approach could be applied in various domains where LLMs are utilized, particularly in environments with diverse user needs and constraints.
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
Robotics
- Introduction of A4D, a framework for affordance-based reasoning in robot planning.
- Mapping of visual observations into a functional latent space to enhance generalizability.
- Significant improvements in inference accuracy for both existing and new affordances.
- Incorporation of an uncertainty-aware affordance discovery mechanism.
Read more
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
Summary
This paper addresses the limitations of existing robot planning systems that rely on appearance-based reasoning, which often fails to generalize to novel interactions due to a lack of understanding of task-relevant functionalities of objects. The authors introduce A4D, a framework that maps visual observations into a functional latent space structured around affordances, such as 'movable' or 'supportable'. This approach allows robots to reason about how objects can be used in tasks rather than just identifying them based on appearance. A4D also incorporates an affordance discovery mechanism that expands the latent space to accommodate unseen scenarios, enhancing the system's adaptability. The evaluation of A4D across various planning tasks shows significant improvements in inference accuracy and efficiency, achieving 94% accuracy on existing affordances and over 90% on new affordances with minimal training data. The framework also enables 100x faster inference, making it suitable for real-time applications.
Methodology
The authors developed A4D, which utilizes a shared functional latent space to map visual observations and affordances. The framework employs proximity measures in this latent space to infer functionalities and quantify uncertainty in affordance inference. Additionally, it selectively triggers an affordance discovery mechanism when existing affordances are insufficient.
Results
A4D achieved 94% inference accuracy on existing affordances, outperforming state-of-the-art methods by over 20 percentage points. For new affordances, inference accuracy improved from approximately 70% to over 90% with less than 10% of the original training data. The framework also enabled 100x faster inference, facilitating real-time planning.
Implications
The findings suggest that A4D can significantly enhance robot planning systems by enabling them to understand and utilize object functionalities more effectively. This could lead to more adaptable and efficient robotic systems capable of handling a wider range of tasks and interactions in dynamic environments.