AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
67
Papers today
8h
Update frequency
7
Days of history
A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection
Theory
Optimization
- Introduces a stationary-distribution theory for ensemble-size selection in Random Forests.
- Models the ensemble size selection process as a birth-death Markov chain.
- Demonstrates that the central ensemble size fluctuates around a stationary regime.
- Derives key scaling relationships for the stationary center and variance of ensemble size.
Read more
A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection
Summary
This paper addresses the challenge of selecting the optimal number of trees in Random Forests, a critical parameter that balances predictive performance and computational cost. The authors introduce a stationary-distribution theory to model the ensemble size selection process as a birth-death Markov chain on a geometric grid. They analyze the behavior of the central ensemble size, Bt, which fluctuates around a stationary regime rather than converging to a deterministic value. The study derives equilibrium equations for both the original and a modified update rule, revealing that the stationary center B* scales as O(ε−2) as ε approaches zero. The variance of the ensemble size is also characterized, indicating that the leading relative spread is independent of ε and influenced by the scale factor and update rule. This work interprets the plateau-based tuning of Random Forests as a stochastic process, providing insights into the dynamics of hyperparameter optimization without the need for a fixed upper bound on the number of trees.
Methodology
The authors model the ensemble size as a birth-death Markov chain on a geometric grid, deriving the stationary distribution through local balance. They utilize a centered folded-normal approximation to obtain equilibrium equations for the update rules and analyze the implications of these equations on the ensemble size's behavior.
Results
The study finds that the stationary center of the ensemble size scales as O(ε−2) and that the variance is O(ε−4). The leading relative spread is shown to be independent of ε, controlled by the scale factor and update rule, indicating a robust framework for understanding the dynamics of Random Forest ensemble-size selection.
Implications
This research provides a theoretical foundation for optimizing the number of trees in Random Forests, potentially leading to more efficient hyperparameter tuning methods that adaptively select ensemble sizes based on performance metrics. It may enhance the reliability of variable importance measures derived from Random Forests, particularly in high-dimensional settings.
Fork-Think with Confidence
NLP
Large Language Models
Efficient ML
- Introduces a new decide-first-then-think paradigm for LLM reasoning.
- Demonstrates significant reductions in token consumption and runtime compared to traditional methods.
- Shows that later forking points can lead to improved generation quality.
- Combines Fork-think with existing techniques like early stopping for enhanced performance.
Read more
Fork-Think with Confidence
Summary
The paper introduces a novel approach to reasoning in large language models (LLMs) called 'Fork-think with confidence', which deviates from the traditional think-first-then-decide paradigm. Instead, it adopts a decide-first-then-think strategy, where the model first identifies promising forking points based on confidence from a single seeding path before generating multiple continuations. This method aims to reduce overgeneration and improve efficiency in reasoning tasks. The authors conduct experiments across three models and three reasoning benchmarks, demonstrating that Fork-think can reduce token consumption by up to 30% and runtime by up to 57%, while achieving comparable or superior performance to existing parallel thinking methods. The analysis reveals that later forking points yield better results, suggesting that the decide-first-then-think paradigm is a promising direction for enhancing LLM reasoning capabilities without the need for extensive retraining or warm-up phases.
Methodology
The authors propose the Fork-think method, which first identifies forking points using model confidence from a single seeding path. After determining these points, the model samples multiple continuations and aggregates them for the final response. The methodology includes systematic evaluations across three models and datasets, along with ablation studies to analyze the impact of different factors on performance.
Results
Fork-think reduces token consumption by up to 30% and runtime by up to 57% compared to traditional parallel thinking methods. The results indicate that the new paradigm not only maintains but can enhance performance, particularly by leveraging later forking points for better generation outcomes.
Implications
The findings suggest that adopting a decide-first-then-think approach could lead to more efficient reasoning in LLMs, potentially reducing computational costs and improving performance in complex reasoning tasks. This could have significant implications for applications requiring high-quality reasoning without extensive retraining.
From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators
Large Language Models
Reinforcement Learning
Optimization
- MetaFlow enables zero-shot workflow generation by learning task-level patterns instead of instance-specific solutions.
- The approach consists of a two-stage training process: supervised fine-tuning followed by reinforcement learning with execution feedback.
- MetaFlow demonstrates strong generalization capabilities, performing well on both trained and untrained tasks.
- The model achieves competitive performance against state-of-the-art baselines across multiple benchmarks.
Read more
From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators
Summary
This paper introduces MetaFlow, a novel approach to automatic workflow generation using large language models (LLMs). Traditional methods either generate instance-specific solutions or require extensive re-optimization for new tasks, limiting their applicability. MetaFlow addresses these challenges by framing workflow generation as a meta-learning problem, allowing the model to learn to compose solution strategies based on task descriptions and operator sets. The training process consists of two stages: first, supervised fine-tuning on synthetic workflow data, followed by reinforcement learning with verifiable rewards (RLVR) that utilizes execution feedback to enhance performance across various problem instances. The results demonstrate that MetaFlow not only produces effective workflows for trained tasks but also exhibits strong zero-shot generalization capabilities for untrained tasks and novel operator sets. The model achieves performance comparable to state-of-the-art methods in question answering, code generation, and mathematical reasoning, showcasing its potential for broader applications in automating complex tasks.
Methodology
MetaFlow employs a two-stage training paradigm, beginning with supervised fine-tuning on synthetic workflow data to establish foundational knowledge of task-operator relationships, followed by reinforcement learning that leverages execution feedback to refine workflow generation. This approach allows for the synthesis of workflows directly from task descriptions and operator sets, facilitating zero-shot generation.
Results
MetaFlow achieves performance on par with state-of-the-art methods in in-domain tasks while demonstrating remarkable zero-shot generalization on out-of-domain tasks and novel operator sets. The model effectively generates structured workflows that are robust and reusable across various problem instances.
Implications
The development of MetaFlow has significant implications for automating complex tasks across diverse domains, reducing the reliance on expert knowledge for workflow design and enhancing productivity. Its ability to generalize to new tasks and operator sets opens up possibilities for broader applications in fields such as software development, data analysis, and artificial intelligence.
Signed-Permutation Coordinate Transport for RMSNorm Transformers
Large Language Models
Theory
Interpretability
- RMSNorm transformers require a signed-permutation gauge for accurate coordinate transport.
- Sign-marginalized Hungarian matching improves matching accuracy compared to raw signed-correlation methods.
- Coordinate-preserving transport outperforms endpoint matching in recovering cross-run coordinates.
- Signed transport maintains the trajectory of optimizer states, unlike permutation-only methods.
Read more
Signed-Permutation Coordinate Transport for RMSNorm Transformers
Summary
This paper addresses the challenges of coordinate-indexed object transport in RMSNorm transformers, which are increasingly used in large language model (LLM) workflows. The author demonstrates that the native discrete gauge for RMSNorm models is a signed-permutation gauge (Bd), as opposed to the permutation gauge (Sd) used in LayerNorm models. This distinction is crucial because permutation-only alignment is insufficient for RMSNorm architectures. The paper introduces a sign-marginalized Hungarian matching approach that overcomes a significant failure mode in raw signed-correlation matching, which is limited by the fraction of positive signs in the true gauge. The author shows that using coordinate-preserving transport significantly improves the recovery of cross-run coordinates during fine-tuning, achieving 91.1% recovery compared to 60.3% with endpoint matching. Additionally, the paper highlights that signed transport preserves the training trajectory of optimizers like AdamW, while permutation-only transport diverges. The findings emphasize the importance of gauge specification for reproducibility and interpretability in coordinate-indexed artifacts.
Methodology
The author employs a theoretical framework to analyze the native gauges of RMSNorm and LayerNorm models, proving the necessity of using signed-permutation gauges. The paper introduces a sign-marginalized Hungarian matching algorithm for coordinate alignment and conducts experiments to compare the effectiveness of coordinate-preserving transport versus endpoint matching in fine-tuning scenarios.
Results
The study finds that using sign-marginalized matching leads to a significant improvement in structural permutation accuracy. In fine-tuning experiments, coordinate-preserving transport recovers 91.1% of coordinates at 1500 steps, while endpoint matching only recovers 60.3%. The results also indicate that signed transport preserves the optimizer's training trajectory, contrasting with the divergence seen in permutation-only transport.
Implications
The findings suggest that adopting signed-permutation transport methods can enhance the performance and interpretability of LLM workflows. This has implications for the development of more robust and reproducible machine learning models, particularly in applications involving coordinate-indexed artifacts.
Transformers as Bayesian In-Context Experimenters: Smoothness-Adaptive Efficient ATE Estimation
Theory
Efficient ML
Optimization
- Introduces Bayesian in-context experimenters using transformer architectures for ATE estimation.
- Utilizes a mixture-of-experts approach to adaptively handle unknown outcome smoothness.
- Proves that the transformer can learn the history-to-propensity mapping through supervised pretraining.
- Demonstrates empirical success in mimicking Bayesian-Neyman allocation behavior and improving ATE precision.
Read more
Transformers as Bayesian In-Context Experimenters: Smoothness-Adaptive Efficient ATE Estimation
Summary
This paper explores the use of transformer architectures as Bayesian in-context experimenters for adaptive experimentation aimed at estimating average treatment effects (ATE). The authors propose a novel framework where transformers are trained to imitate a Bayesian posterior Neyman teacher, which updates beliefs about potential outcomes based on experimental history. The key innovation lies in the incorporation of a mixture-of-experts (MoE) architecture to adaptively manage unknown outcome smoothness, allowing the transformer to concentrate on the most relevant experts as data accumulates. The methodology involves using attention-based sufficient statistics and projected gradient descent to implement Bayesian updates efficiently. The authors demonstrate that this approach can achieve near-oracle performance in ATE estimation, balancing bias and variance effectively. Empirical results show that the transformer model can accurately mimic the behavior of the Bayesian teacher and significantly improve ATE precision compared to traditional methods, achieving smoothness-adaptive minimax rates across various unseen smoothness levels.
Methodology
The authors develop a framework where transformers are trained to imitate a Bayesian posterior Neyman teacher. They utilize attention-based sufficient statistics and projected gradient descent for efficient Bayesian updates. A mixture-of-experts architecture is employed to adaptively manage the complexity of potential outcomes based on smoothness, allowing the model to focus on the most relevant experts as data is collected. The training process involves supervised empirical risk minimization over finite pretraining trajectories.
Results
The experiments confirm that the transformer model can accurately imitate the Bayesian teacher's behavior, achieving smoothness-adaptive minimax rates across seven unseen smoothness levels. The trained design transformer effectively reproduces Bayesian-Neyman allocation patterns and reduces AIPW ATE estimation mean squared error relative to uniform randomization, approaching the performance of oracle Neyman allocation.
Implications
This work has significant implications for adaptive experimentation in clinical trials and online platforms, where efficient ATE estimation is crucial. The proposed transformer-based framework can automate the process of treatment assignment, improving the precision of experimental outcomes while reducing the need for manual engineering.
SP-CACW: Convergence-Aware Client Weighting for Selfish Personalized Learning
Federated Learning
- Introduction of SP-CACW framework for selfish personalized learning in federated settings.
- Convergence-aware client weighting minimizes the target client's convergence error.
- Guarantees provided for convergence rates, particularly in cluster-structured problems.
- Demonstrated effectiveness on multiple datasets, outperforming existing methods.
Read more
SP-CACW: Convergence-Aware Client Weighting for Selfish Personalized Learning
Summary
The paper introduces SP-CACW, a novel framework for selfish personalized learning in federated learning environments. Traditional federated learning optimizes a global average objective, which can be ineffective for clients with significantly different data distributions. The authors focus on selfish personalization, where a target client aims to minimize its own risk using peer gradients while avoiding negative transfer. SP-CACW employs a convergence-aware client-weighting mechanism that selects aggregation weights by minimizing an upper bound on the target client's convergence error. This approach allows for the assignment of zero weight to harmful peers and effectively balances peer bias against stochastic variance. The authors provide convergence guarantees under specific assumptions and demonstrate the effectiveness of SP-CACW on datasets such as MNIST, CIFAR-100, and LEAF Shakespeare, showing competitive or improved performance over existing personalized and clustering-based methods.
Methodology
The SP-CACW framework determines optimal client aggregation weights by minimizing an upper bound on the convergence rate of a designated target client. This involves analyzing the trade-off between peer bias and stochastic variance, allowing the framework to assign zero weight to peers that could negatively impact the target client's learning process.
Results
The SP-CACW framework was evaluated on MNIST, CIFAR-100, and LEAF Shakespeare datasets, where it consistently showed competitive performance compared to strong personalized and clustering baselines. The results indicate that SP-CACW effectively improves convergence rates for the target client while managing the risks associated with peer contributions.
Implications
The findings suggest that SP-CACW can enhance personalized learning in federated environments, particularly for clients with distinct data distributions. This has practical applications in scenarios such as inter-silo networks and public data exploitation, where clients can benefit from collaborative learning without compromising their individual objectives.
Fisher-Routed Mixture of Experts for Federated Class-Incremental Learning
Federated Learning
- Introduces FedFMX to address challenges in federated class-incremental learning.
- Develops FRES and AES modules for expert routing and selection based on stability-plasticity trade-offs.
- Employs routing-aware regularization to promote balanced expert utilization.
- Demonstrates superior performance over existing methods on multiple benchmark datasets.
Read more
Fisher-Routed Mixture of Experts for Federated Class-Incremental Learning
Summary
This paper introduces the Fisher-Routed Mixture of Experts for Federated Class-Incremental Learning (FedFMX), a novel framework designed to tackle the unique challenges of federated learning (FL) in class-incremental scenarios. The authors identify three primary challenges: capacity conflict and catastrophic forgetting due to model overloading, heterogeneity from non-IID data, and synchronized class misalignment among clients. FedFMX addresses these issues through adaptive expert specialization, routing each sample to a subset of experts that optimize knowledge acquisition and retention. The framework incorporates a Fisher-Routed Expert Scoring (FRES) module to assess expert importance based on stability and plasticity, and an Adaptive Expert Selection (AES) module to dynamically determine expert subsets. Additionally, a routing-aware regularization (RAR) strategy is employed to ensure load balance and efficient training. The theoretical convergence rate of the method is proven to be O(T−1). Extensive experiments on benchmark datasets such as CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that FedFMX outperforms state-of-the-art methods, showcasing its effectiveness in mitigating catastrophic forgetting and enhancing collaboration among heterogeneous clients.
Methodology
The methodology involves a modular approach where each sample is routed to a dynamically selected subset of experts. The FRES module quantifies expert suitability using Fisher information to balance stability and plasticity, while the AES module formulates expert selection as a cooperative game. The RAR strategy is implemented to enhance routing robustness and mitigate biased expert usage.
Results
The experiments conducted on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets show that FedFMX consistently outperforms state-of-the-art methods in terms of accuracy and robustness against catastrophic forgetting, validating the effectiveness of the proposed framework.
Implications
The findings suggest that FedFMX can significantly improve federated learning applications in dynamic environments where clients continuously encounter new classes, making it suitable for real-world scenarios such as mobile applications and IoT systems.
Low-dimensional topology of deep neural networks
Theory
- Layer-skipping in ResNets is as powerful as attention mechanisms in transformers for changing linking numbers.
- Feedforward networks with monotonic activations are less expressive than ResNets and transformers.
- Nonmonotonic activations can enhance the expressivity of feedforward networks, allowing them to match the capabilities of more complex architectures.
- The study highlights the importance of low-dimensional topology in understanding and designing neural network architectures.
Read more
Low-dimensional topology of deep neural networks
Summary
This paper explores the low-dimensional topology of deep neural networks, specifically focusing on layered models such as feedforward networks, ResNets, and transformers, while constraining each layer to a width of d = 3 (R3 representation space). By doing so, the authors aim to analyze how topological invariants, particularly linking numbers, evolve through the layers of the networks. The study reveals that the layer-skipping feature in ResNets is as effective as the attention mechanism in transformers for altering linking numbers. Furthermore, it establishes a hierarchy of expressivity among different architectures, demonstrating that ResNets and transformers outperform feedforward networks with monotonic activations, while nonmonotonic activations can elevate feedforward networks to the same expressivity level as the former. The authors argue that low-dimensional topology can serve as a valuable framework for guiding the design of AI architectures and generalize their findings to higher dimensions. The paper provides both theoretical proofs and empirical experiments to support these insights.
Methodology
The authors employed a theoretical approach to analyze the topological properties of neural networks by fixing the width of layers to d = 3. They focused on the linking number as a topological invariant to study the effects of different architectural features, including layer-skipping, attention mechanisms, and activation functions. Empirical experiments were conducted to validate the theoretical insights, particularly on linked data structures like Hopf links and higher-dimensional linked spheres.
Results
The paper demonstrates that width-3 feedforward networks with monotonic activations cannot separate linked manifolds, establishing a fundamental limitation in their expressivity. In contrast, architectures that incorporate folding mechanisms, such as ResNets and transformers, can effectively unlink these manifolds, leading to improved classification performance. The experiments confirmed that models with folding capabilities achieved higher accuracy on synthetic datasets and CIFAR-10.
Implications
The findings suggest that incorporating topological considerations into neural network design could lead to more effective architectures, particularly for tasks involving complex data structures. This approach may enhance the understanding of how different architectural features contribute to model performance and guide future innovations in AI.
Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation
NLP
Large Language Models
Optimization
- Introduces a set-valued prediction framework for multi-agent routing from natural language prompts.
- Develops a WildChat-derived benchmark with 3,000 prompts and a controlled agent catalog for evaluation.
- Demonstrates that supervised routing methods outperform nearest-neighbor and zero-shot approaches.
- Highlights the effectiveness of a weighted routing layer in improving utility in constrained settings.
Read more
Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation
Summary
This paper presents a novel approach to multi-agent routing framed as a set-valued prediction problem, addressing the need for selecting multiple agents based on natural language prompts while considering execution costs. The authors introduce a benchmark derived from WildChat, comprising 3,000 prompts and a fixed catalog of 12 agents, with AI-assisted heuristic labels for multi-label evaluation. The evaluation protocol incorporates various set-level metrics, latency considerations, and a cost-aware routing framework. The study compares several routing methods, including nearest-neighbor matching, supervised multi-label classification, and a fine-tuned encoder, alongside a zero-shot LLM baseline. The findings demonstrate that supervised routers significantly outperform simpler methods, with the fine-tuned encoder achieving the highest accuracy in unconstrained settings. The research highlights the importance of balancing prediction quality and execution costs, particularly through the use of a weighted routing layer that enhances utility when combined with strong supervised scorers. Overall, the benchmark and evaluation framework established in this work facilitate reproducible studies on accuracy-cost trade-offs in multi-agent routing scenarios.
Methodology
The authors constructed a benchmark from real user prompts, assigning heuristic labels under a fixed agent catalog. They evaluated multiple routing methods, including KNN, supervised multi-label classification, and a fine-tuned encoder, using a shared set-evaluation protocol that incorporates set-level metrics and a cost-aware selection policy.
Results
The results indicate that supervised routing methods significantly outperform simpler techniques like nearest-neighbor matching and zero-shot LLM routing. The fine-tuned encoder achieved the highest accuracy in unconstrained settings, while the linear multilabel model served as the strongest practical baseline. In constrained scenarios, the weighted routing layer improved performance, particularly when applied to strong supervised scorers.
Implications
This research provides a framework for developing more efficient multi-agent systems that can effectively balance the trade-offs between accuracy and execution costs. The benchmark and evaluation methods can be utilized for further studies in multi-agent routing, enhancing the design of AI systems that require tool selection based on user queries.
Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning
Federated Learning
- Introduces a lightweight clustering approach using Random Network Distillation for Federated Learning.
- Decouples clustering from the main training loop, reducing computational and communication costs.
- Enables autonomous discovery of client groups based on local novelty estimates.
- Demonstrates effectiveness on computer vision benchmarks with non-IID data.
Read more
Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning
Summary
This paper addresses the challenges of Federated Learning (FL) under non-independently and identically distributed (non-IID) data, which can degrade model performance. The authors propose a novel approach called Random Network Distillation (RND) for Clustered Federated Learning (CFL), which allows for the discovery of client groups based on data similarity without the need for extensive computational resources or raw data sharing. By training compact RND models on local data, clients can estimate their prediction errors as novelty signals, facilitating the identification of meaningful clusters prior to the main training phase. This decoupling of clustering from the learning process enhances efficiency and adaptability in large-scale distributed systems, where the number of clusters and collaboration structures are not predetermined. The method is validated on computer vision benchmarks with synthetic noise to simulate feature skewness, demonstrating its effectiveness in separating devices into relevant clusters before model training.
Methodology
The authors leverage Random Network Distillation (RND) to train compact models on local data, which are then used to compute prediction errors as novelty signals. These signals help in estimating inter-device similarity for clustering. The clustering phase is performed independently of the main learning task, allowing for periodic re-evaluation as data distributions shift over time.
Results
The proposed RND-based clustering approach successfully identified meaningful client clusters in simulated non-IID conditions, leading to improved model training outcomes. The results indicate that the method effectively separates devices into relevant groups, enhancing the overall performance of Federated Learning in heterogeneous environments.
Implications
This research has significant implications for Federated Learning applications, particularly in scenarios with diverse client data distributions. The ability to autonomously discover collaboration structures can lead to more efficient and effective model training in privacy-sensitive environments, such as mobile devices and IoT systems.
Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning
NLP
Large Language Models
Efficient ML
- Fora reformulates capability preservation as function-space protection, focusing on activation subspaces rather than weight geometry.
- The method employs a unique update mechanism that combines high-capacity and controlled spectral channels to protect capabilities.
- Fora consistently outperforms traditional fine-tuning methods in preserving existing capabilities while adapting to new tasks.
- The study highlights the importance of projecting onto capability-derived directions rather than weight-derived directions for effective adaptation.
Read more
Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning
Summary
The paper introduces Fora (Function-space Orthogonal Residual Adaptation), a novel approach to fine-tuning large language models that aims to preserve existing capabilities while adapting to new tasks. Traditional full fine-tuning methods risk eroding the model's pre-existing capabilities due to unconstrained gradient updates. Existing methods to mitigate this issue rely on proxies such as parameter distances or output matching, which do not directly address the activation directions that define a capability. The authors argue that capabilities should be characterized by the activation subspace they induce rather than the singular geometry of the weight matrix. To implement this, they derive a protected subspace from the capability itself using label-free calibration inputs to estimate the principal directions of input-activation covariance. The Fora update combines a high-capacity branch that is structurally barred from reading capability function directions with a narrow spectral channel for controlled plasticity. The method is validated across three settings on the Qwen3-1.7B model, demonstrating superior performance in preserving capabilities while adapting to new tasks compared to weight-space projection and standard regularization techniques.
Methodology
The authors developed Fora by estimating the principal directions of input-activation covariance from label-free calibration inputs. They constructed a right projector to protect the activation subspace and combined it with a left projector derived from the weight's singular value decomposition (SVD). The update mechanism is defined as ∆W = PUMPQ + U2DδV⊤2, where PU and PQ serve to block reads and writes to sensitive directions, respectively.
Results
Fora was tested in three scenarios: learning COGS while preserving translation, learning GSM8K while preserving translation, and learning translation while preserving math. In all cases, Fora maintained high accuracy and low perplexity, outperforming weight-space projection and standard regularization methods. For instance, while preserving translation, the perplexity remained at 4.39, close to the baseline of 4.35, demonstrating effective capability preservation.
Implications
The findings suggest that fine-tuning methods can be significantly improved by focusing on the activation subspaces that define capabilities, which could lead to more robust models in applications requiring multitasking and capability retention. This approach may be particularly beneficial in scenarios where models need to adapt to new tasks without losing previously learned skills.
Estimating Supply Incrementality in Two-sided Marketplaces: A Causal Machine Learning Approach
Theory
- The paper introduces a causal machine learning approach to estimate supply incrementality in two-sided marketplaces.
- It combines double machine learning with a hierarchical Bayesian framework to address endogeneity and substitution effects.
- The methodology utilizes geospatial measures to enhance feature construction and improve model accuracy.
- The results indicate that the model can effectively estimate the impact of additional supply on total bookings across different listing segments.
Read more
Estimating Supply Incrementality in Two-sided Marketplaces: A Causal Machine Learning Approach
Summary
This paper addresses the challenge of estimating supply incrementality in two-sided marketplaces, specifically using the Airbnb platform as a case study. The authors propose a causal machine learning approach that combines double/debiased machine learning with a hierarchical Bayesian framework to estimate the causal relationship between additional supply and marketplace outcomes, such as total bookings. The methodology leverages geospatial measures of product segment similarity to construct informative features for the model. The paper highlights the importance of understanding how additional supply impacts different listing segments, as this can inform marketplace strategies and optimize supply-demand balance. The authors demonstrate that their model can provide plausible estimates of marketplace returns to additional supply and exhibit strong out-of-sample performance, making it applicable to various two-sided marketplaces beyond Airbnb.
Methodology
The authors employ a two-stage double machine learning framework. In the first stage, they model historical variations in supply and bookings to mitigate endogeneity. In the second stage, they use residuals from the first stage to isolate the causal impact of supply on bookings. The model incorporates geospatial similarity measures to enhance predictive power and utilizes a hierarchical Bayesian framework to integrate prior knowledge with new data.
Results
The model provides plausible estimates of the returns to additional supply in the Airbnb marketplace, demonstrating strong out-of-sample performance. It effectively captures heterogeneous treatment effects across different listing segments, revealing how various factors influence supply incrementality.
Implications
The findings can help marketplace operators optimize supply strategies by identifying which segments benefit most from additional listings. The methodology can be applied to other two-sided marketplaces, enhancing their understanding of supply-demand dynamics and improving overall marketplace efficiency.
On the Convergence of Self-Improving Online LLM Alignment
Reinforcement Learning
Large Language Models
Optimization
- Introduction of SAIL-RevKL, a regularized version of the SAIL algorithm to improve convergence properties.
- Proof of the Polyak-Lojasiewicz condition for the regularized objective, ensuring global convergence.
- Demonstration of near-linear sample complexity for achieving desired accuracy.
- Empirical validation showing superior performance of SAIL-RevKL over the original SAIL.
Read more
On the Convergence of Self-Improving Online LLM Alignment
Summary
This paper presents the Self-Improving Alignment (SAIL) algorithm, which addresses the challenges of distribution shift in online reinforcement learning from human feedback (RLHF) by transforming a bilevel optimization problem into a more efficient single-level formulation. Despite its empirical success, the original SAIL lacks formal convergence guarantees due to the non-strongly concave nature of its objective function. The authors propose a regularized version, SAIL-RevKL, which incorporates a reverse Kullback-Leibler (KL) divergence penalty to enhance the optimization landscape. The main theoretical contribution is the proof that this regularized objective satisfies the Polyak-Lojasiewicz (PL) condition within a bounded parameter space, leading to global convergence guarantees and near-linear sample complexity. Empirical evaluations on MuJoCo benchmarks and LLM alignment tasks demonstrate that SAIL-RevKL outperforms the original SAIL, confirming its effectiveness and stability in practical applications.
Methodology
The authors analyze the SAIL algorithm through first-order optimization techniques, identifying issues with the Hessian's curvature that prevent strong concavity. They introduce a reverse KL divergence regularization to modify the objective function, proving that this adjustment leads to a well-conditioned optimization problem. They then establish convergence guarantees using a projected stochastic gradient ascent scheme under specific structural assumptions.
Results
The proposed SAIL-RevKL algorithm satisfies the PL condition, leading to global linear convergence of the objective gap. The sample complexity to achieve accuracy ε is shown to be ˜O(ε−1 log(ε−1)). Empirical results indicate that SAIL-RevKL significantly outperforms the vanilla SAIL on both continuous-control benchmarks and LLM alignment tasks.
Implications
The findings suggest that incorporating regularization techniques like reverse KL divergence can enhance the performance and reliability of online RLHF algorithms, making them more robust against distribution shifts. This has significant implications for the deployment of large language models in real-world applications where alignment with human preferences is critical.
ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields
Theory
Generative Models
Computer Vision
- Introduces ScaleAware-JEPA for constructing latent coordinates in continuous scalar fields.
- Utilizes Constrained Diffusion Decomposition (CDD) to separate fields into scale components.
- Aligns context prediction with the diffusion scale rather than fixed image patches.
- Demonstrates effectiveness in diverse applications, including MHD turbulence and urban structures.
Read more
ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields
Summary
The paper introduces ScaleAware-JEPA, a novel framework designed to construct dense, label-free latent coordinates for continuous scalar fields, addressing the challenges posed by multiscale structures in scientific data. Traditional self-supervised methods often rely on fixed image coordinates, which misalign with the continuous scale hierarchy inherent in physical fields. ScaleAware-JEPA utilizes Constrained Diffusion Decomposition (CDD) to decompose fields into pixel-registered scale components, allowing for a context footprint that is tied to the diffusion scale of each component. This approach enables the prediction of hidden structures in a manner that reflects the intrinsic scale hierarchy of the field rather than arbitrary patch sizes. The framework demonstrates its effectiveness across various domains, including magnetohydrodynamic (MHD) turbulence, interstellar molecular gas, and urban nighttime-light structures, resulting in the creation of dense structural atlases without the need for labels or predefined segmentation rules. By linking latent predictions to the scale hierarchy, ScaleAware-JEPA facilitates the inspection of complex physical patterns prior to their formal characterization.
Methodology
ScaleAware-JEPA reformulates joint-embedding predictive learning for multiscale continuous fields. It employs Constrained Diffusion Decomposition (CDD) to extract a pixel-registered pyramid of continuous scale components, which informs both the representation and the predictive task. A dense ConvNeXt-style encoder maps the multiscale input to a full-resolution latent field, with masking geometry dictated by the CDD scales.
Results
The framework successfully constructs latent coordinates that reveal coherent morphological structures across various physical fields. The learned representations map back to meaningful physical patterns, demonstrating the ability to uncover hidden structures without prior knowledge of relevant organizing variables.
Implications
ScaleAware-JEPA has the potential to enhance scientific discovery by providing a robust tool for analyzing complex physical systems. Its ability to generate meaningful latent representations could facilitate new insights in fields such as astrophysics, fluid dynamics, and urban studies, where understanding multiscale interactions is crucial.
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
Large Language Models
Interpretability
Theory
- Scratchpads can enhance model alignment by exposing intermediate reasoning.
- Models trained to write intermediate states perform better in causal reasoning tasks than those that only report final states.
- Editing internal representations of written states allows for testing whether models compute from those states.
- The study demonstrates that running-state supervision can make scratchpad states causally usable in model computations.
Read more
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
Summary
This paper investigates the effectiveness of scratchpads in machine learning models, particularly in the context of process supervision and intermediate reasoning. The authors propose that for scratchpads to be useful for model alignment, the states written in them must be causally linked to the model's computations. They conduct experiments using a controlled state-tracking task with known update rules, comparing models that report only final states to those that write intermediate states. The key experiment involves editing the internal representation of a written state while keeping the scratchpad text unchanged, allowing the authors to assess whether the model utilizes the modified state in its subsequent computations. The results reveal that models trained to write intermediate states exhibit a significant ability to predict downstream consequences based on edited states, achieving 80% and 91% accuracy across two task variants, while control models show minimal performance. This indicates that the training process can enable models to use scratchpad states as part of their computation, thereby enhancing the interpretability and reliability of their reasoning processes.
Methodology
The authors employed a controlled state-tracking task where models were trained to either report only the final state or to write intermediate states. They utilized activation patching and representation editing techniques to manipulate internal states and assess the impact on model predictions. The evaluation involved editing a written state and observing the model's ability to predict the next state based on this edit.
Results
The state-writing model, specifically Qwen2.5-Coder-7B, demonstrated a strong ability to predict the next state based on edited internal representations, achieving 80% and 91% accuracy across two task variants. In contrast, pretrained and final-answer-only models remained near baseline performance, indicating that the training approach significantly influenced the model's reasoning capabilities.
Implications
The findings suggest that enhancing the causal relationship between written states and model computations can improve the interpretability and reliability of machine learning models. This has potential applications in developing more aligned AI systems that can provide transparent reasoning processes, which is crucial for trust and accountability in AI applications.
Scalar Representations of Neural Network Training Dynamics
Optimization
Theory
- Training trajectories of ANNs can be represented as temporal networks.
- Scalar embeddings preserve critical dynamical features of training dynamics.
- A characteristic time analogous to Lyapunov time captures decorrelation in training trajectories.
- Asymptotic training states exhibit a common statistical distribution.
Read more
Scalar Representations of Neural Network Training Dynamics
Summary
This paper explores the training dynamics of artificial neural networks (ANNs) by treating their optimization trajectories as temporal networks. The authors apply scalar embedding techniques to represent these high-dimensional training dynamics in a low-dimensional space. Using a multilayer perceptron (MLP) trained on the MNIST dataset, they demonstrate that the scalar embedding effectively captures key dynamical features, such as sensitivity to initial conditions and the maximum Lyapunov exponent. A characteristic time, analogous to Lyapunov time, is defined to describe the decorrelation of trajectories in the original high-dimensional space. Additionally, the authors analyze the statistical organization of asymptotic training states, finding that the distributions of rescaled spacings conform to a skew lognormal distribution. Overall, the study suggests that scalar low-dimensional embeddings are valuable for understanding and visualizing the optimization dynamics of neural networks.
Methodology
The authors utilize scalar embedding techniques to transform high-dimensional training trajectories of a multilayer perceptron into a low-dimensional representation. They analyze the resulting embeddings to identify dynamical features, define characteristic times, and study the statistical properties of training states.
Results
The scalar embedding successfully captures the main dynamical features of the training process, including sensitivity to initial conditions and the maximum Lyapunov exponent. The characteristic time defined from the embedded trajectories aligns with the decorrelation time in the original parameter space. The analysis of asymptotic spacings reveals a skew lognormal distribution across different initial conditions.
Implications
The findings suggest that scalar low-dimensional embeddings can enhance the understanding of neural network training dynamics, potentially leading to improved optimization strategies and greater interpretability of neural networks.
Review Residuals: Update-Conditioned Residual Gating for Transformers
Large Language Models
Theory
Efficient ML
- Introduction of Review Residuals, which scale updates based on a learned gate conditioned on the proposed update.
- Demonstration of depth-stability issues with convex gating forms, leading to the adoption of an additive form for stable training.
- Significant performance improvements at larger model sizes, particularly at 590M and 1B parameters, compared to standard and Highway residuals.
- The method shows a trajectory of growing benefits with model size, indicating its potential for large-scale applications.
Read more
Review Residuals: Update-Conditioned Residual Gating for Transformers
Summary
This paper introduces Review Residuals, a novel approach to scaling updates in transformer architectures by using an input-dependent gate that evaluates the reliability of proposed updates before committing them. Unlike traditional residual connections that add updates with a fixed coefficient of one, Review Residuals condition the update scaling on both the current state and the proposed update. The study reveals that while a convex form of the gate leads to vanishing gradients and fails to train effectively beyond 20 layers, an additive form preserves identity and trains stably at greater depths. The authors conducted experiments across various model sizes (60M to 1B parameters) and found that Review Residuals do not show significant advantages at smaller scales but outperform both parameter-matched Highway gates and standard residuals at larger scales (590M and 1B parameters), with benefits compounding as model size increases. This suggests that Review Residuals are particularly advantageous for large-scale models, aligning with the goal of enhancing performance in deep learning architectures.
Methodology
The methodology involves creating a new residual update mechanism where each update is scaled by a learned gate that considers both the current state and the proposed update. The experiments were conducted using a multi-seed approach across different model sizes to evaluate the performance of Review Residuals against traditional residual methods.
Results
The results indicate that Review Residuals do not provide advantages at smaller model sizes (60M) but show statistically significant improvements at 590M and 1B parameters, outperforming both parameter-matched Highway gates and standard residuals. The additive form of the gate maintained stable training across deeper architectures, avoiding vanishing gradient issues.
Implications
The findings suggest that incorporating mechanisms for evaluating update reliability can enhance the performance of large transformer models, making Review Residuals a promising approach for future research and applications in deep learning, particularly in large language models.
AdaJEPA: An Adaptive Latent World Model
Reinforcement Learning
Robotics
Optimization
- AdaJEPA allows for real-time adaptation of latent world models during planning.
- The model utilizes self-supervised signals from observed transitions to improve predictions.
- AdaJEPA shows substantial performance improvements in both in-distribution and out-of-distribution tasks.
- The approach is efficient, requiring only a single gradient step for adaptation.
Read more
AdaJEPA: An Adaptive Latent World Model
Summary
The paper introduces AdaJEPA, an adaptive latent world model designed to enhance planning capabilities in high-dimensional environments by allowing for test-time adaptation. Traditional latent world models are typically fixed post-training, which can lead to inaccuracies during planning, especially under distribution shifts. AdaJEPA addresses this limitation by integrating a self-supervised adaptation mechanism within the closed loop of model predictive control (MPC). After executing an action, AdaJEPA uses the observed state transition to update the model, thereby recalibrating it for subsequent planning. This approach allows the model to continuously improve its predictions based on real-time feedback without requiring additional expert demonstrations. The authors demonstrate that AdaJEPA significantly enhances planning success across various goal-reaching tasks, achieving notable improvements even with minimal adaptation steps.
Methodology
AdaJEPA employs a closed-loop model predictive control framework where the model is continuously updated based on the latest observations after executing actions. It integrates a self-supervised learning mechanism that allows the model to adapt its predictions using real-time feedback from the environment, optimizing only a small subset of parameters with each update.
Results
The experimental results indicate that AdaJEPA significantly outperforms traditional frozen models in various goal-reaching tasks, particularly when training data is limited. The model demonstrates improved planning success rates and robustness to distribution shifts, validating the effectiveness of the adaptive approach.
Implications
AdaJEPA's ability to adapt during deployment has significant implications for real-world applications in robotics and autonomous systems, where environments can change unpredictably. This adaptive capability could lead to more resilient and efficient planning systems that better handle dynamic conditions.
Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
NLP
Large Language Models
Efficient ML
- LOTUS is the first latent-CoT method to match explicit CoT performance at the 3B scale.
- The method reduces thought-phase latency by 2.5 to 6.9 times compared to traditional CoT.
- LOTUS employs a looped padded Transformer architecture with parallel supervision on gold CoT tokens.
- The latent space of LOTUS is interpretable, recovering gold reasoning steps and alternative valid steps.
Read more
Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
Summary
This paper introduces LOTUS (Looped Transformers with parallel supervision on latents), a novel approach that combines latent and explicit reasoning in language models. Traditional chain-of-thought (CoT) reasoning generates intermediate steps sequentially, which can be computationally expensive. Latent CoT methods, while more efficient, struggle to match the performance of explicit CoT at larger scales. The authors propose using looped Transformers, which reuse weights to increase computational depth without adding parameters, to enhance latent reasoning. LOTUS processes multiple latent blocks in parallel and applies cross-entropy loss on each latent position to align with explicit CoT tokens. This method successfully bridges the performance gap at the 3B parameter scale, achieving comparable accuracy to explicit CoT while significantly reducing thought-phase latency. The results demonstrate that LOTUS not only recovers gold reasoning steps but also reveals alternative valid intermediate steps, indicating the interpretability of its latent space.
Methodology
LOTUS utilizes a looped padded Transformer architecture that processes K latent blocks in parallel for R iterations. It applies cross-entropy loss on each latent position's corresponding gold CoT-step token, ensuring direct supervision. An auxiliary decoder is also introduced for training purposes, further enhancing the model's performance.
Results
LOTUS achieves near CoT accuracy on the GSM8K test set at the 3B scale, significantly outperforming prior latent methods and matching explicit CoT performance. The model reduces thought-phase latency by 2.5 to 6.9 times, demonstrating efficiency in reasoning tasks.
Implications
The findings suggest that LOTUS can enhance the efficiency and effectiveness of reasoning in large language models, making it applicable in various NLP tasks that require multi-step reasoning. Its interpretability could also facilitate better understanding and trust in AI systems.
Robust Strategic Classification under Decision-Dependent Cost Uncertainty
Optimization
Theory
- Introduces a two-stage robust optimization framework for strategic classification.
- Models decision-dependent cost uncertainty, reflecting real-world scenarios.
- Demonstrates that awareness of temporal effects can enhance robustness against manipulation.
- Shows that strategic sacrifices in performance can lead to better long-term outcomes.
Read more
Robust Strategic Classification under Decision-Dependent Cost Uncertainty
Summary
This paper addresses the issue of strategic behavior in algorithmic decision-making systems, where individuals manipulate their input data to achieve favorable outcomes. Existing literature on strategic classification often assumes fixed costs for such manipulations, neglecting the reality that these costs can evolve based on past decisions made by classifiers. The authors propose a novel two-stage robust optimization framework that incorporates decision-dependent uncertainty, allowing for a more accurate modeling of how classifier decisions influence future manipulation costs. The framework is validated through numerical experiments using real-world data from college admissions, demonstrating that classifiers can strategically sacrifice some immediate performance to enhance long-term robustness against manipulation. The findings suggest that by accounting for the temporal effects of decisions on manipulation costs, classifiers can effectively shape strategic responses and mitigate gaming behavior over time.
Methodology
The authors formulate the classifier selection as a two-stage robust optimization (TSRO) problem, where the first stage anticipates the influence of classifier decisions on future response costs. They address the complexity of the second stage, which involves nonlinear constraints and objectives, by proposing problem-specific approximations and reformulations. Techniques such as Benders decomposition and Column-and-Constraint Generation are employed to solve the relaxed problem.
Results
The numerical experiments reveal that the proposed dependency-aware classifier sacrifices some first-stage performance for enhanced second-stage robustness. Additionally, the method reduces overall loss by altering agents' costs, thereby limiting strategic manipulation. This indicates that incorporating decision-dependent costs can lead to more effective decision-making in algorithmic systems.
Implications
The findings suggest that decision-makers in various fields, such as education and finance, can improve their algorithmic systems by considering the temporal effects of their decisions on manipulation costs. This approach could lead to more robust and fair decision-making processes, ultimately reducing the potential for strategic gaming.
A Bayesian Filtering Approach for Learning Lagrangian Dynamics from Noisy Measurements
Robotics
Theory
Optimization
- Introduces a Bayesian state estimation framework for learning Lagrangian dynamics from noisy measurements.
- Models unknown forces as white Gaussian noise, leading to a stochastic state-space model.
- Utilizes neural networks to parameterize kinetic and potential energies within the Lagrangian formulation.
- Demonstrates improved performance over traditional LNNs and approximate Bayesian filters in numerical experiments.
Read more
A Bayesian Filtering Approach for Learning Lagrangian Dynamics from Noisy Measurements
Summary
This paper presents a novel Bayesian filtering approach to learn the dynamics of physical systems using partial and noisy measurements, specifically through a Lagrangian mechanics framework. The authors parameterize the kinetic and potential energies of the system using neural networks, treating unknown external forces as white Gaussian noise. By applying the Euler–Lagrange equations, they derive a continuous-time stochastic state-space model that captures the system dynamics. The methodology involves jointly estimating the neural network parameters and system states through a maximum-likelihood method, utilizing Gaussian-approximation-based Bayesian filters. The effectiveness of this approach is validated through numerical experiments on pendulum and Duffing oscillator systems, demonstrating improved performance over conventional Lagrangian neural networks (LNNs) and approximate Bayesian filters that rely on known system models. This work addresses the limitations of existing LNN formulations that typically require full-state observations, making it applicable in real-world scenarios where only partial measurements are available.
Methodology
The authors develop a Bayesian estimation framework that combines Lagrangian dynamics with neural network parameterization. They employ approximate Bayesian filtering methods, such as the extended Kalman filter, to jointly estimate system states and learn neural network parameters from noisy, partial observations.
Results
The proposed method shows significant improvements in estimating system dynamics compared to conventional LNNs and approximate Bayesian filters. Numerical experiments on pendulum and Duffing oscillator examples validate the effectiveness of the approach, particularly in scenarios with limited and noisy data.
Implications
This research has potential applications in various fields, including robotics, industrial automation, and health technology, where accurate modeling of physical systems is crucial but often hindered by incomplete or noisy data. The approach enhances the ability to learn and predict system dynamics in real-world environments.
Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
NLP
Large Language Models
Theory
- Introduces a noisy expert model to explain the performance gap between offline and online imitation learning.
- Demonstrates that offline learning from noisy trajectories has exponential sample complexity, while online methods can achieve polynomial complexity.
- Proposes a novel variant of On-Policy Distillation (OPD) that outperforms traditional methods under noisy expert feedback.
- Provides theoretical insights and empirical results supporting the effectiveness of on-policy methods in training language models.
Read more
Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
Summary
This paper addresses the gap between theoretical and practical performance of imitation learning (IL) methods, particularly in the context of noisy expert feedback. The authors propose a model where the learner interacts with a noisy version of an expert's policy, which is common in real-world applications such as language model training. They demonstrate that while offline IL methods like Behavior Cloning (BC) are theoretically optimal, they struggle with noisy trajectories, leading to exponential sample complexity. In contrast, the proposed On-Policy Distillation (OPD) method shows polynomial dependence on the horizon, allowing for more efficient learning from noisy experts. The authors introduce a new loss function and provide algorithms that outperform traditional offline methods and existing OPD objectives in both synthetic and natural language tasks. Their findings suggest that on-policy methods can be more effective than supervised fine-tuning when dealing with imperfect expert feedback, providing a theoretical foundation for this observation.
Methodology
The authors develop a theoretical framework for imitation learning with a focus on noisy expert feedback. They analyze the sample complexity of offline versus online learning and propose a variant of On-Policy Distillation (OPD) that leverages online interactions with a noisy expert. The paper includes the derivation of a new loss function and empirical evaluations on synthetic and natural language tasks to validate their theoretical claims.
Results
The proposed OPD variant significantly outperforms both offline Behavior Cloning and existing OPD objectives in scenarios with high noise levels. The theoretical analysis shows that online interaction with a noisy expert allows for polynomial sample complexity, contrasting with the exponential complexity required for offline learning from noisy data.
Implications
The findings suggest that in practical applications where expert feedback is noisy, on-policy methods like OPD can lead to better performance than traditional offline methods. This has significant implications for training language models and other sequential decision-making systems, where expert demonstrations may not always be perfect.
Sequential sparse Gaussian process quantile regression
Theory
Efficient ML
Optimization
- Introduces a sparse Bayesian quantile regression formulation using Laplace approximation.
- Develops adaptive strategies for inducing-input infilling and data acquisition.
- Combines enrichment mechanisms into a unified sequential algorithm.
- Demonstrates improved computational efficiency and predictive accuracy in numerical experiments.
Read more
Sequential sparse Gaussian process quantile regression
Summary
This paper presents a novel framework for Gaussian process (GP) quantile regression that addresses significant computational challenges in Bayesian settings, particularly the nonconjugacy of the asymmetric Laplace likelihood and the complexity of posterior inference. The authors introduce a sparse GP approach where the quantile function is represented using a reduced set of inducing variables, and posterior inference is performed via a Laplace approximation. They decompose predictive uncertainty into conditional-prior and posterior-induced variance components, which informs two adaptive mechanisms: inducing-input infilling and data acquisition. These mechanisms are integrated into a sequential algorithm that dynamically allocates computational resources to the most significant sources of uncertainty and controls model complexity. The proposed method demonstrates improved efficiency and accuracy in numerical experiments, outperforming traditional predefined data-acquisition strategies.
Methodology
The authors employ a sparse Gaussian process framework that utilizes a Laplace approximation for posterior inference. They introduce adaptive strategies for selecting inducing inputs and acquiring additional data based on predictive uncertainty. The sequential algorithm balances the allocation of computational resources between improving the inducing representation and acquiring new training data.
Results
Numerical experiments validate the accuracy of the Laplace approximation and the effectiveness of the adaptive strategies for inducing-input placement and data acquisition. The proposed sequential enrichment strategy shows superior performance compared to traditional predefined approaches, leading to enhanced predictive accuracy and reduced computational costs.
Implications
This work has significant implications for applications requiring quantile regression, such as risk assessment and decision-making under uncertainty. The adaptive nature of the proposed methods allows for more efficient modeling in scenarios with limited data or high uncertainty, making it applicable in various fields including finance, healthcare, and engineering.
C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders
NLP
Large Language Models
Interpretability
- Identifies the lack of cross-sample consistency as a root cause of feature splitting and absorption in Sparse Autoencoders.
- Introduces C2R, a regularization technique that enforces consistent latent selection across samples.
- Demonstrates that C2R significantly mitigates issues of feature splitting and absorption without compromising reconstruction fidelity.
- Provides a theoretical analysis and empirical validation of the proposed method against state-of-the-art baselines.
Read more
C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders
Summary
This paper addresses the challenges faced by Sparse Autoencoders (SAEs) in interpreting large language models, particularly focusing on issues of feature splitting and absorption. Feature splitting occurs when coherent concepts are fragmented into multiple non-atomic latents, while feature absorption leads to the distortion of general features due to specific latents capturing their activations. The authors argue that these problems arise from inconsistent latent assignments across samples, which is exacerbated by traditional per-sample optimization methods. To mitigate these issues, they propose Cross-sample Consistency Regularization (C2R), a novel approach that encourages consistent representation of semantic features across a batch by penalizing the co-activation of similar latents. The paper provides a theoretical framework for understanding these phenomena and demonstrates that C2R effectively reduces feature splitting and absorption while maintaining reconstruction fidelity. Empirical evaluations show that C2R outperforms existing methods, leading to improved feature hierarchy and interpretability in SAEs.
Methodology
The authors developed C2R based on the geometry of the Minkowski inequality and the strict convexity of the ℓ2 norm. This method penalizes the distribution of semantic information across redundant latents by applying constraints across the batch dimension, thus promoting the consolidation of activations into a single, consistent latent representation.
Results
The empirical results indicate that C2R effectively reduces the occurrences of feature splitting and absorption compared to existing methods, leading to a more coherent feature hierarchy while preserving the reconstruction fidelity of the Sparse Autoencoders.
Implications
The findings suggest that C2R can enhance the interpretability of features in large language models, making it a valuable tool for tasks requiring clear and reliable latent representations. This has potential applications in causal analysis, circuit discovery, and improving the overall understanding of model behavior.
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models
NLP
Large Language Models
Efficient ML
- MoC integrates local and global control signals for enhanced representation learning.
- The framework allows efficient cross-block communication without significant computational overhead.
- MoC maintains memory efficiency while outperforming traditional state-based fine-tuning methods.
- The approach is architecture-agnostic, applicable to various transformer configurations.
Read more
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models
Summary
The paper introduces Mixture-of-Control (MoC), a novel framework for state-based fine-tuning of transformer models that enhances representation learning by integrating local and global control signals. Traditional state-based methods often limit inter-block communication, which hinders representational adaptation and can lead to inefficiencies in memory usage. MoC addresses these limitations by treating block-wise control states as experts in a sparse mixture-of-experts framework, allowing for efficient communication across transformer blocks without incurring significant computational overhead. The authors demonstrate that MoC not only maintains memory efficiency but also outperforms existing state-based methods across various benchmarks in natural language understanding (NLU) and natural language generation (NLG). The framework is architecture-agnostic and supports adaptive information propagation across layers, ultimately improving robustness and adaptation while matching the parameter and memory efficiency of other parameter-efficient fine-tuning (PEFT) methods.
Methodology
The authors propose a lightweight state-based method that couples local and global control signals through a recurrent mixture-of-experts control pathway. Each layer's low-rank control module is treated as an expert, with a shared gate routing representations to a Top-K set of experts, producing a global control signal that is mixed with local control for adaptive information propagation.
Results
Empirical evaluations across eight pretrained transformer backbones indicate that MoC consistently improves adaptation performance while maintaining strong parameter and memory efficiency compared to existing PEFT baselines.
Implications
The Mixture-of-Control framework has the potential to enhance the efficiency and effectiveness of fine-tuning large transformer models, making it applicable in various NLP tasks and potentially other domains where transformer architectures are utilized.
Adaptive Block Diffusion: Resolving Training-Inference Mismatch in Diffusion Language Models
NLP
Large Language Models
Generative Models
- Introduction of Adaptive Block Diffusion (ABD) to resolve training-inference mismatch in DLMs.
- ABD treats denoising configurations as stochastic variables, optimizing over a full configuration space.
- Guarantees denoising optimality for any inference policy supported during training.
- Demonstrates structural invariance, avoiding off-grid degradation and maintaining performance across scales.
Read more
Adaptive Block Diffusion: Resolving Training-Inference Mismatch in Diffusion Language Models
Summary
This paper introduces Adaptive Block Diffusion (ABD), a novel framework designed to address the training-inference mismatch commonly observed in Diffusion Language Models (DLMs). Traditional DLMs are constrained by fixed context structures, which limits their ability to generalize across various token configurations during inference. ABD overcomes this limitation by treating denoising configurations as stochastic variables, allowing the model to be trained over a diverse distribution of prefix-window pairs. This approach ensures that the model maintains structural invariance across different decoding scales, effectively eliminating off-grid degradation and recovering a consistent relationship between block size and perplexity. The authors demonstrate that ABD not only matches but often outperforms fixed-block models at their target configurations while maintaining strong performance across all decoding scales. The findings suggest that ABD can significantly enhance the flexibility and efficiency of DLMs in generating high-quality text.
Methodology
The methodology involves training a single diffusion language model over a stochastic distribution of prefix-window configurations. This is achieved by optimizing denoising risk across various configurations, ensuring that the model is structurally invariant to different decoding strategies. The authors utilize a Radon–Nikodym argument to formalize the alignment between training and inference configurations.
Results
Empirical results show that ABD exhibits structural invariance across different decoding scales, recovering a monotonic relationship between block size and perplexity. The model avoids sharp degradation in performance when evaluated outside its training support, demonstrating its ability to generalize effectively. ABD matches or outperforms fixed-block specialists at their target configurations, indicating its robustness and versatility.
Implications
The implications of this work suggest that ABD can significantly improve the performance of diffusion language models in practical applications, particularly in scenarios requiring flexible and efficient text generation. This framework could lead to advancements in natural language processing tasks that benefit from high-quality, parallelizable generation methods.
Contextual Slate GLM Bandits with Limited Adaptivity
Theory
Optimization
Reinforcement Learning
- Introduction of B-SlateGLinCB and RS-SlateGLinCB algorithms for contextual slate bandits.
- Establishment of regret bounds that are independent of the non-linearity parameter κ.
- Demonstration of computational efficiency with polynomial time complexity per round.
- Empirical validation showing superior performance compared to existing limited adaptivity baselines.
Read more
Contextual Slate GLM Bandits with Limited Adaptivity
Summary
This paper addresses the contextual slate bandit problem with generalized linear rewards under limited adaptivity. The authors propose two algorithms, B-SlateGLinCB and RS-SlateGLinCB, tailored for batched and rarely-switching settings, respectively. In the batched setting, B-SlateGLinCB divides the time horizon into O(log log T) batches, ensuring that each batch's policy relies solely on data from prior batches. Conversely, RS-SlateGLinCB performs O(Nd log T) parameter updates adaptively in the rarely-switching setting. The authors establish regret bounds of O(Nd3/2√T) for B-SlateGLinCB and O(Nd√T) for RS-SlateGLinCB, which notably do not depend on the non-linearity parameter κ, a common factor in GLM bandit algorithms. Both algorithms are computationally efficient, requiring only polynomial time per round despite the exponential number of possible slates. Empirical results demonstrate that these algorithms outperform existing baselines with limited adaptivity and remain competitive with the fully adaptive Slate-GLM-OFU algorithm. Additionally, a modified version of B-SlateGLinCB matches the performance of this baseline. The practical implications of the work are illustrated through a language model selection task, showcasing the algorithms' effectiveness in real-world applications.
Methodology
The authors develop two algorithms: B-SlateGLinCB for the batched setting, which partitions the time horizon into batches, and RS-SlateGLinCB for the rarely-switching setting, which limits parameter updates. They analyze the regret bounds theoretically and validate the algorithms through simulations against existing methods.
Results
B-SlateGLinCB achieves a regret bound of O(Nd3/2√T), while RS-SlateGLinCB achieves O(Nd√T). Both algorithms demonstrate competitive performance against the state-of-the-art Slate-GLM-OFU algorithm, with a modified B-SlateGLinCB empirically matching its performance.
Implications
The findings suggest that the proposed algorithms can be effectively applied in various online decision-making scenarios, such as dynamic ad optimization and landing page selection, where limited adaptivity is a constraint.
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
Large Language Models
Reinforcement Learning
Optimization
- QVAL provides a cheap, training-free method to evaluate dense supervision signals for LLM agents.
- The framework allows for direct comparison of different dense supervision methods without conflating results with training engineering factors.
- Benchmarking reveals that simple prompting methods consistently outperform more complex dense supervision techniques.
- QVAL is extensible, facilitating the evaluation of new methods and environments.
Read more
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
Summary
The paper introduces QVAL, a training-free testbed designed to evaluate dense supervision signals for long-horizon tasks performed by Large Language Model (LLM) agents. Traditional outcome-only rewards are insufficient for guiding LLMs through complex tasks that require numerous actions, leading to the development of dense supervision methods that score intermediate actions. However, evaluating these methods typically involves expensive downstream performance assessments that conflate supervision quality with training engineering factors. QVAL addresses this issue by providing a framework to measure how well a method's scores align with Q-values from a strong reference policy, allowing for direct comparison of different dense supervision methods without the need for training. The authors benchmark 21 dense supervision methods across four environments and seven methodological families, revealing that simpler prompting techniques often outperform more complex methods. The QVAL framework is designed to be extensible, enabling researchers to efficiently iterate on dense supervision approaches before integrating them into training pipelines.
Methodology
QVAL constructs a dataset of state-action pairs labeled with reference Q-values from an expert policy. It evaluates the Q-alignment of various dense supervision methods by scoring how well their predicted scores order the actions relative to these reference Q-values, isolating the quality of the supervision signal from other training confounders.
Results
The evaluation of 21 dense supervision methods across four environments showed significant performance differences, with direct prompting and ranking methods achieving the highest Q-alignment scores. The results indicated strong clustering of performance by methodological family, suggesting that certain approaches are more effective than others across various tasks.
Implications
QVAL has the potential to streamline the development and evaluation of dense supervision methods, allowing researchers to refine their approaches before engaging in costly training processes. This could lead to improved performance in long-horizon tasks for LLM agents, enhancing their applicability in real-world scenarios.
Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models
Large Language Models
Theory
Generative Models
- Calibration oracles outperform traditional unit tests in detecting statistical misspecifications in probabilistic programs.
- A benchmark of 14 misspecification types shows that calibration can flag 88% of bugs with a low false positive rate.
- Calibration feedback significantly enhances the repair process for LLM-generated programs compared to unit test feedback.
- A substantial portion of runnable programs generated by LLMs are statistically misspecified, indicating a critical need for calibration.
Read more
Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models
Summary
This paper addresses the challenge of detecting and repairing misspecified probabilistic programs generated by language models (LLMs). While LLMs can produce code that compiles and passes unit tests, these programs may still be statistically incorrect. The authors propose a calibration oracle based on the Bayesian workflow, which includes posterior predictive checks, simulation-based calibration, and sampler diagnostics, as a more effective means of verification than traditional unit tests. They create a benchmark of 14 types of misspecifications across 10 model families, demonstrating that the calibration oracle can detect 88% of bugs at a 2% false positive rate, significantly outperforming unit tests. The paper also shows that calibration feedback is more effective in repairing LLM-generated programs than unit test feedback, with notable improvements in model performance. Additionally, the authors find that a significant percentage of runnable programs generated by LLMs are statistically misspecified, highlighting the need for calibration in the programming process. Overall, the study emphasizes that correctness in probabilistic programming hinges on calibration rather than mere compilation.
Methodology
The authors developed a calibration oracle based on the Bayesian workflow, which includes simulation-based calibration, posterior predictive checks, and sampler diagnostics. They constructed a benchmark to evaluate the detection capabilities of the oracle against various types of misspecifications and compared the effectiveness of calibration feedback in repairing LLM-generated programs versus traditional unit test feedback.
Results
The calibration oracle achieved an AUC of 0.97 in detecting misspecifications, flagging 88% of bugs at a 2% false positive rate. A reference-free version of the oracle detected 62-78% of bugs without needing a correct program. Calibration feedback improved the performance of models like GPT-5.1 and Claude significantly, with success rates increasing from 33% to 92% and 75% to 100%, respectively. The study also found that 15-47% of runnable programs generated by LLMs were statistically misspecified.
Implications
The findings suggest that integrating calibration into the development process of probabilistic programs can enhance the reliability and accuracy of models generated by LLMs. This approach could lead to better statistical modeling practices and improve the deployment of LLMs in generating complex probabilistic programs.
ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies
Robotics
Reinforcement Learning
Generative Models
- ReGuide repurposes guided rollouts as training data for iterative self-improvement.
- Phase-Conditioned Guidance (PCG) is used to generate reliable corrective rollouts.
- The framework allows for both fine-tuning and retraining of policies using successful guided rollouts.
- ReGuide demonstrates significant performance improvements over existing methods, achieving 1.3–7.7× success rate increases.
Read more
ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies
Summary
The paper introduces ReGuide, a novel framework designed to enhance the performance of behavior-cloned diffusion policies, which are susceptible to covariate shift. Traditional methods either expand the training distribution through expert corrections or utilize test-time guidance from a learned model, but these approaches have limitations. ReGuide innovatively repurposes guided rollouts as reusable on-policy recovery data, allowing for iterative self-improvement of the policy. It employs Phase-Conditioned Guidance (PCG) to generate corrective rollouts that are phase-specific and only applied in recoverable scenarios. Successful rollouts are then integrated back into the policy through two mechanisms: ReGuide-FT (fine-tuning) and ReGuide-FS (retraining from scratch). This iterative process significantly enhances the policy's robustness against distribution shifts, ultimately leading to improved task success rates across various Robomimic tasks.
Methodology
ReGuide employs Phase-Conditioned Guidance (PCG) to create corrective rollouts that are phase-specific and applicable in recoverable scenarios. It integrates successful rollouts back into the policy using two update mechanisms: ReGuide-FT for fine-tuning and ReGuide-FS for retraining from scratch. The process is iterative, allowing the policy to continuously improve based on the quality of the recovery data generated during execution.
Results
ReGuide improved the success rates of base policies by 1.3 to 7.7 times across various Robomimic tasks. It outperformed existing test-time guidance methods like LPB, demonstrating that the guided recovery data is more beneficial than simply increasing the volume of unguided rollouts. The iterative nature of the framework further enhances performance with each cycle of rollout generation and policy update.
Implications
The ReGuide framework has significant implications for robotics and other fields where imitation learning is applied, particularly in scenarios involving long-horizon tasks. By enabling policies to self-improve without the need for continuous expert input, it reduces the reliance on expert demonstrations and enhances the adaptability of robotic systems in dynamic environments.
FedXDS: Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning
Federated Learning
Interpretability
Efficient ML
- FedXDS is the first approach to leverage XAI for selective data sharing in federated learning.
- The method uses propagation-based attribution to identify and share only task-relevant features.
- FedXDS incorporates differential-metric privacy to enhance privacy guarantees while maintaining utility.
- Experimental results show superior accuracy and faster convergence compared to state-of-the-art methods.
Read more
FedXDS: Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning
Summary
This paper introduces FedXDS, a novel approach that utilizes explainable AI (XAI) methods to address the challenges of data heterogeneity in federated learning (FL). Federated learning allows multiple clients to collaboratively train models without sharing raw data, but it often suffers from performance degradation due to non-IID (independent and identically distributed) data distributions across clients. The authors propose using feature attribution techniques to identify and selectively share the most relevant data elements among clients, thereby mitigating the effects of data heterogeneity. By employing propagation-based attribution methods, FedXDS identifies task-relevant features through a single backward pass, enabling efficient data sharing while maintaining privacy. The method incorporates differential-metric privacy techniques to ensure that sensitive information is protected, achieving strong privacy guarantees while enhancing model performance. Experimental results demonstrate that FedXDS consistently outperforms existing methods in terms of accuracy and convergence across various client numbers and heterogeneity settings, while also providing robustness against membership inference and feature inversion attacks.
Methodology
The authors developed FedXDS by integrating propagation-based feature attribution methods to identify relevant features for model predictions. This selective sharing of features is combined with differential-metric privacy techniques to protect sensitive information while improving model performance. The method involves a single backward pass over local datasets to generate attribution maps that highlight the most relevant input features.
Results
The experimental evaluations indicate that FedXDS achieves higher accuracy and faster convergence than existing federated learning methods across different client configurations and levels of data heterogeneity. The approach also provides strong theoretical and empirical privacy guarantees, demonstrating resilience against membership inference and feature inversion attacks.
Implications
FedXDS has significant implications for the deployment of federated learning systems, particularly in scenarios where data privacy is paramount. By enabling effective data sharing while preserving privacy, this approach can enhance model performance in various applications, such as healthcare, finance, and any domain where sensitive data is prevalent.
Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics
Optimization
Theory
Efficient ML
- Muon optimizer avoids slow saddle-to-saddle dynamics, leading to faster convergence.
- It remains stable even with learning rates exceeding critical thresholds.
- The optimizer conserves a different matrix quantity than gradient descent, affecting convergence behavior.
- A learning rate schedule is proposed that achieves alignment in only two optimization steps.
Read more
Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics
Summary
This paper investigates the Muon optimizer's performance in matrix factorization tasks, contrasting its dynamics with those of traditional gradient descent. The authors identify three significant advantages of Muon: it avoids slow saddle-to-saddle dynamics, remains stable with high learning rates, and conserves a different matrix quantity during optimization. The Muon optimizer facilitates rapid convergence by allowing all singular values to grow uniformly, leading to early alignment of weights. The paper also presents a learning rate schedule that achieves near-perfect alignment in just two optimization steps. The findings suggest that Muon can effectively accelerate training in large models, making it a promising alternative to conventional optimization methods in matrix factorization.
Methodology
The authors analyze the Muon optimizer's dynamics in matrix factorization problems through theoretical derivations and empirical experiments. They compare the Muon update rule, which involves orthogonalization of gradients, to traditional gradient descent. The study includes deriving alignment rates and proposing a structured learning rate schedule to optimize convergence.
Results
The study demonstrates that the Muon optimizer leads to faster alignment of weights compared to gradient descent, with empirical results aligning closely with theoretical predictions. The proposed learning rate schedule effectively achieves alignment in just two steps, significantly enhancing the optimization process in matrix factorization tasks.
Implications
The findings suggest that the Muon optimizer could be a valuable tool for training large models in various applications, including recommender systems and representation learning. Its ability to handle high learning rates and achieve rapid convergence may lead to more efficient training processes in machine learning.
Sequential RC-TGAN: Generating Relational Time Series with Spectral Envelope Loss
Generative Models
Time Series
Optimization
- Introduction of Seq. RC-TGAN, a temporal extension of the RC-TGAN framework.
- Novel spectral envelope loss function that optimizes latent periodic structures in categorical time series.
- Extension of spectral methodology to continuous features using a Gaussian Mixture Model discretization.
- Development of new evaluation metrics for assessing frequency-domain fidelity.
Read more
Sequential RC-TGAN: Generating Relational Time Series with Spectral Envelope Loss
Summary
This paper presents the Sequential RC-TGAN (Seq. RC-TGAN), a novel framework for generating synthetic relational time series data that effectively captures complex temporal dynamics, particularly in categorical time series. Traditional methods, such as one-hot encoding, fail to represent intrinsic frequency-domain features like seasonality and cyclicity. The authors introduce a differentiable loss function based on Spectral Envelope Theory, which allows the generator to optimize the preservation of latent periodic structures through backpropagation. The methodology includes a Variational Gaussian Mixture Model (VGM) discretization strategy to extend spectral envelope regularization to continuous time series. The authors establish a rigorous evaluation standard by simulating categorical time series with known theoretical spectral envelopes, creating a robust benchmark for assessing the frequency-domain fidelity of their generative framework. They also propose two new evaluation metrics: Spectral Density Divergence and Spectral Envelope Divergence. Experimental results demonstrate that Seq. RC-TGAN significantly outperforms existing state-of-the-art systems in reproducing cyclic patterns and long-term seasonality across both categorical and continuous features.
Methodology
The Seq. RC-TGAN framework integrates Spectral Envelope Theory into a conditional sequential GAN architecture. It employs a novel spectral loss function to minimize the distance between the spectral envelopes of real and synthetic data. The framework also utilizes a Variational Gaussian Mixture Model for discretizing continuous features, allowing for the simultaneous capture of frequency-domain features across mixed data types.
Results
The experimental results indicate that Seq. RC-TGAN significantly outperforms existing generative models in accurately reproducing cyclic patterns and long-term seasonality in both categorical and continuous features. The proposed metrics effectively evaluate the frequency-domain fidelity of the generated time series.
Implications
The findings suggest that Seq. RC-TGAN can be applied in various domains requiring synthetic data generation, particularly where relational databases and complex temporal dynamics are involved, such as finance, healthcare, and social networks. The integration of frequency-domain analysis into generative models opens new avenues for improving the realism and utility of synthetic data.
Replica Symmetry Breaking and Algorithmic Thresholds in Empirical Risk Minimization under Multi-Index Model
Theory
Optimization
Efficient ML
- Develops a precise understanding of the empirical risk landscape in high-dimensional settings.
- Introduces the Incremental Approximate Message Passing (IAMP) algorithm for empirical risk minimization.
- Characterizes the relationship between training and test errors in the context of high-dimensional asymptotics.
- Demonstrates that the proposed algorithm achieves optimal performance among polynomial-time algorithms.
Read more
Replica Symmetry Breaking and Algorithmic Thresholds in Empirical Risk Minimization under Multi-Index Model
Summary
This paper addresses the challenges of optimizing high-dimensional non-convex empirical risk functions in machine learning, particularly in the context of the multi-index model. The authors explore the landscape of empirical risk minimization (ERM) and identify the regions accessible to polynomial-time algorithms. They introduce an Incremental Approximate Message Passing (IAMP) algorithm, which is designed to learn models based on m-dimensional projections of data derived from standard Gaussian feature vectors. The paper provides a detailed analysis of the training and test errors achieved by the IAMP algorithm in high-dimensional asymptotics, specifically as the sample size (n) and dimensionality (d) approach infinity while maintaining a constant ratio (n/d → α). The authors assert that the performance of their proposed algorithm is optimal among polynomial-time algorithms, contributing to a deeper understanding of the empirical risk landscape and the algorithmic thresholds for achieving low training and test errors.
Methodology
The authors utilize a theoretical framework based on statistical physics and algorithmic analysis to study the empirical risk minimization problem. They introduce the IAMP algorithm, which incrementally approximates the message passing process to optimize the empirical risk function. The analysis includes characterizing the training error and establishing connections to stochastic optimal control and dual values.
Results
The paper presents results showing that the IAMP algorithm achieves a specific training error that is optimal in the polynomial-time regime. The relationship between training and test errors is characterized, revealing insights into the performance of the algorithm as the dimensionality of the data increases. The findings suggest that the algorithm can effectively navigate the complex landscape of local minima typical in high-dimensional empirical risk functions.
Implications
The results have significant implications for the design of machine learning algorithms, particularly in understanding the limits of polynomial-time optimization methods in high-dimensional settings. The insights gained from this study can inform future research on algorithmic strategies for empirical risk minimization and the theoretical foundations of machine learning.
Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding
Large Language Models
Efficient ML
- PRR reduces per-token decoding latency by up to 40% while maintaining accuracy.
- The method exploits temporal locality in DSA selections for efficient prediction and speculation.
- Incremental attention repair allows for efficient correction of missed blocks without recomputing attention from scratch.
- PRR achieves significant speedup over existing DSA methods across multiple LLMs.
Read more
Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding
Summary
This paper introduces PRR (Predict, Reuse, and Repair), a novel runtime for dynamic sparse attention (DSA) that enhances the decoding efficiency of long-context large language models (LLMs). DSA improves performance by focusing on the top-K key-value (KV) blocks relevant to each query, but it creates a latency bottleneck due to the serialized dependency between selection and attention stages. PRR addresses this issue by leveraging temporal locality in DSA selections to predict likely blocks, allowing speculative attention computation while the selection process is ongoing. The method employs a lightweight exponential moving average (EMA)-based predictor to anticipate the upcoming top-K set and utilizes a profiling-guided speculation budget to manage the speculative workload. Once the actual selected set is known, PRR incrementally repairs any missed blocks using a customized CUDA kernel based on FlashAttention, ensuring that the final output maintains the accuracy of standard DSA. The proposed approach significantly reduces per-token decoding latency by up to 40% across various long-context benchmarks while preserving task accuracy.
Methodology
PRR employs a lightweight EMA-based predictor to anticipate the top-K KV blocks, executes speculative attention in parallel with the selection process, and implements an incremental repair mechanism using a customized CUDA kernel to efficiently incorporate missed blocks into the attention output.
Results
PRR achieves a per-token decoding speedup of 1.42× over Quest and 1.56× over InfLLM-v2 across various long-context benchmarks, demonstrating substantial improvements in efficiency without sacrificing accuracy.
Implications
The findings suggest that PRR can significantly enhance the performance of LLMs in applications requiring long-context processing, such as deep research and multi-step reasoning, making them more practical for real-world use cases.
Optimizing Nursing Care Taxi Dispatch Leveraging Integer Linear Programming Solvers and Machine Learning
Optimization
- Introduction of the Nursing Care Taxi Dispatch (NCTD) problem as a complex variant of the Vehicle Routing Problem (VRP).
- Utilization of a supervised machine learning approach based on the Transformer architecture combined with integer linear programming solvers.
- Effective handling of multiple constraints including wheelchair accessibility and user compatibility.
- Demonstrated improvements in operating time and constraint violation rates compared to existing methods.
Read more
Optimizing Nursing Care Taxi Dispatch Leveraging Integer Linear Programming Solvers and Machine Learning
Summary
This paper presents a novel optimization problem termed Nursing Care Taxi Dispatch (NCTD), which is a complex variant of the Vehicle Routing Problem (VRP). The NCTD problem incorporates multiple real-world constraints such as wheelchair accessibility, user compatibility, and strict pick-up and drop-off time windows. Unlike traditional methods that often simplify constraints, this approach addresses the multifaceted needs of nursing care transportation. The authors propose a hybrid methodology that combines supervised machine learning, specifically leveraging the Transformer architecture, with integer linear programming (ILP) solvers. Initially, high-quality solutions are generated using ILP for given inputs, which are then utilized to train the machine learning model. Post-processing techniques are applied to ensure that all constraints are satisfied in the generated paths. The performance of the proposed method is evaluated against existing approaches, focusing on operating time, execution time, and constraint violation rates using real-world facility data. The results indicate that the proposed method achieves a significant reduction in operating time while maintaining low constraint violations, demonstrating its effectiveness in optimizing nursing care taxi dispatch.
Methodology
The authors developed a hybrid optimization approach that first employs integer linear programming (ILP) to generate high-quality solutions for the NCTD problem. These solutions are then used to train a supervised machine learning model based on the Transformer architecture. Post-processing is applied to ensure compliance with all constraints.
Results
The proposed method achieved a reduction in operating time across all problem sizes and regions, with a notable decrease of up to 8% for instances involving fewer than 30 users. The method also maintained a low constraint violation rate compared to existing optimization techniques.
Implications
The findings suggest that the proposed optimization framework can significantly enhance the efficiency of nursing care transportation services, addressing the unique needs of elderly users. This approach may also be applicable to other specialized transportation services requiring complex routing solutions.
Learning Gaussian Graphical Models from a Glauber Trajectory Without Mixing
Graph Learning
Theory
- Introduces a polynomial-time algorithm for learning Gaussian graphical models from Glauber dynamics.
- Establishes a method that does not depend on mixing time, addressing a significant gap in existing approaches.
- Combines local edge testing and robust statistical aggregation to handle temporal dependencies in data.
- Provides theoretical guarantees for the accuracy of the proposed method despite the challenges posed by non-i.i.d. observations.
Read more
Learning Gaussian Graphical Models from a Glauber Trajectory Without Mixing
Summary
This paper addresses the challenge of learning the structure of a d-sparse Gaussian graphical model (GGM) from a single trajectory of Glauber dynamics, a scenario common in applications where observations are temporally correlated rather than independent and identically distributed (i.i.d.). The authors present a polynomial-time algorithm that successfully recovers the conditional-independence graph from such a trajectory without relying on mixing time. The algorithm consists of three main components: estimating conditional variances and rescaling the trajectory to the unit-diagonal case, designing a local edge test to extract adjacency information from short update windows, and aggregating local statistics using a robust median-based estimator. The paper demonstrates that the proposed method can achieve accuracy despite the temporal dependence inherent in the data, filling a gap in existing literature where polynomial-time solutions for learning GGMs from non-i.i.d. samples were lacking.
Methodology
The methodology involves three key steps: first, estimating conditional variances and normalizing the trajectory to a unit-diagonal form; second, conducting local edge tests to gather adjacency information from short update windows; and third, using a robust median-based estimator to aggregate these local statistics, ensuring accuracy in the presence of temporal dependence.
Results
The authors prove that their algorithm can accurately recover the conditional-independence graph from a single Glauber trajectory, achieving this in polynomial time without the need for mixing time considerations. This represents a significant advancement over previous methods that required i.i.d. samples or relied on mixing-based reductions.
Implications
The findings have broad implications for fields that utilize Gaussian graphical models, particularly in scenarios where data is collected over time rather than as independent samples. This could enhance the analysis of complex systems in neuroscience, genomics, and other areas where understanding conditional dependencies is crucial.
Improving Certified Robustness via Adversarial Distillation
Theory
Optimization
- Introduction of AD-CERT, a new certified training objective combining adversarial distillation and IBP bounds.
- AD-CERT achieves state-of-the-art certified accuracy on multiple robustness benchmarks.
- Logit-level distillation from a robust teacher is shown to be more effective than clean or feature-space distillation.
- The method provides a better trade-off between certified robustness and standard accuracy.
Read more
Improving Certified Robustness via Adversarial Distillation
Summary
This paper presents a novel approach to enhance the certified robustness of neural networks through a method called Adversarial Distillation for Certification (AD-CERT). The authors note that traditional certified training methods, while effective in producing certifiable models, often compromise standard accuracy. Conversely, adversarial training improves empirical robustness but is challenging to certify. The proposed AD-CERT combines adversarial distillation from a robust teacher model with Interval Bound Propagation (IBP) upper bounds, creating a new certified training objective. This method leverages the strengths of both adversarial training and certified training, resulting in improved certified accuracy without sacrificing standard performance. The authors demonstrate that AD-CERT achieves state-of-the-art certified performance across various benchmarks, outperforming existing certified training methods. Additionally, they provide a systematic analysis showing that logit-level distillation is more effective than other distillation approaches in preserving robustness and enhancing certified training outcomes.
Methodology
The authors propose AD-CERT, which integrates adversarial distillation from a robust teacher model with IBP upper bounds. This involves distilling adversarial information at the logit level, creating a scalarization between a teacher-guided lower bound and a certified upper bound on the robust loss. The methodology includes extensive experimental evaluations across standard certified training benchmarks to validate the performance of AD-CERT.
Results
AD-CERT achieves state-of-the-art certified accuracy improvements of up to 5.40 percentage points over existing methods. The experimental results indicate that the proposed approach effectively combines the benefits of adversarial training and certified training, leading to models that are both robust and certifiable.
Implications
The findings suggest that AD-CERT could be applied in safety-critical domains where certified robustness is essential, such as autonomous driving and medical diagnostics. The improved trade-offs between certified robustness and standard accuracy may lead to more reliable AI systems in real-world applications.
OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models
Generative Models
Optimization
Efficient ML
- OTCache provides a training-free framework for accelerating diffusion model sampling.
- It overcomes limitations of existing caching methods by modeling schedule evolution as a smooth trajectory in policy space.
- The framework achieves significant acceleration in sampling while improving fidelity.
- Experiments validate the effectiveness of OTCache across multiple datasets.
Read more
OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models
Summary
The paper introduces OTCache, a novel framework designed to accelerate diffusion model sampling through a training-free caching schedule prediction. Traditional graph-based caching methods optimize for shortest-path objectives but often fail in low NFE (Number of Function Evaluations) scenarios due to an additive independence assumption. OTCache addresses this limitation by modeling caching schedules as a smooth evolution in policy space, inspired by Optimal Transport (OT). The framework operates in three stages: first, it generates a high-fidelity reference schedule using a graph-based method under a conservative budget; second, it conducts a lightweight anchor search in a low-budget setting using Optuna optimization; and third, it predicts schedules for target budgets through quantile interpolation between the reference and anchor policies. Experimental results demonstrate that OTCache significantly accelerates sampling, achieving 4.5×, 4.7×, and 3.66× speedups on datasets FLUX.1, Qwen-Image, and HunyuanVideo, respectively, while also enhancing generation fidelity compared to existing caching methods.
Methodology
OTCache employs a three-stage process: (1) it establishes a high-fidelity reference schedule using a graph-based caching method; (2) it performs a lightweight anchor search for low-budget settings via Optuna optimization; (3) it predicts target-budget schedules through quantile interpolation in Wasserstein space, leveraging the geometric structure of Optimal Transport.
Results
OTCache achieved acceleration factors of 4.5× on FLUX.1, 4.7× on Qwen-Image, and 3.66× on HunyuanVideo, while consistently improving generation fidelity over state-of-the-art caching baselines.
Implications
The findings suggest that OTCache can enhance the deployment of diffusion models in resource-constrained environments, making them more accessible for real-time applications. The approach also opens avenues for further research in optimal transport applications within machine learning.
CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation
Multimodal
Large Language Models
NLP
- Introduces CoMet, a method for uncertainty estimation in multimodal large language models.
- Decomposes uncertainty into context-specific and multiplicity-specific components.
- Utilizes a lightweight post-hoc uncertainty module for efficient estimation.
- Demonstrates improved performance on multimodal benchmarks compared to existing methods.
Read more
CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation
Summary
The paper addresses the challenge of uncertainty estimation in multimodal large language models (MLLMs), which is crucial for reliable decision-making in complex, open-ended tasks. The authors propose a novel method called CoMet, which decomposes uncertainty into two components: context-specific uncertainty, which reflects ambiguity induced by the context (e.g., task or prompt), and multiplicity-specific uncertainty, which captures the number of plausible answers compatible with the input. CoMet employs a lightweight post-hoc uncertainty module that enables efficient estimation without the need for autoregressive answer generation or repeated sampling. The method is evaluated on various multimodal benchmarks, including hallucination detection and multiple-choice visual question answering, demonstrating consistent improvements over existing baselines while maintaining efficiency. The proposed approach provides a more nuanced understanding of uncertainty in MLLMs, which is essential for applications requiring self-evaluation, abstention, and self-improvement.
Methodology
The authors develop a decomposition of predictive uncertainty into context-specific and multiplicity-specific terms. They introduce a binary matching variable to quantify semantic compatibility between inputs and answers, employing Rényi entropy to measure uncertainty in open-ended settings. This approach avoids reliance on model-specific generation capabilities and costly fine-tuning.
Results
CoMet consistently outperforms existing uncertainty estimation methods across various open-ended multimodal benchmarks, showing significant improvements in accuracy and efficiency in uncertainty estimation.
Implications
The proposed method enhances the reliability of multimodal AI systems by providing better uncertainty estimates, which can improve decision-making processes in applications such as self-evaluation, abstention, and hallucination detection.
Probabilistic Inversion with Flow Matching
Generative Models
Optimization
Theory
- Flow Matching is adapted for probabilistic inversion in geophysics, enhancing traditional methods.
- Probabilistic inversion allows for uncertainty assessment without requiring initial guesses or regularization.
- The method is evaluated through case studies, demonstrating its applicability to both simple and complex models.
- Flow Matching bridges the gap between Variational Inference and Diffusion Models in probabilistic modeling.
Read more
Probabilistic Inversion with Flow Matching
Summary
This paper presents the application of Flow Matching, a technique from generative AI, to the field of probabilistic inversion in geophysics, particularly in seismic Full-Waveform inversion. The authors adapt the mathematical framework of Flow Matching to probabilistic inversion, which traditionally faces challenges due to the ill-posed nature of inverse problems. The paper outlines the limitations of deterministic inversion methods, which often rely on unique solutions and require significant regularization. In contrast, probabilistic inversion aims to express the non-uniqueness of solutions by finding a probability distribution of possible models given observed data. The authors demonstrate the effectiveness of their approach through two case studies: a simple 2D velocity model and the OpenFWI dataset, showcasing the method's capability to handle complex seismic velocity models without the need for initial guesses or regularization. The results indicate that Flow Matching can efficiently transform a known initial distribution into a target distribution, thereby facilitating the probabilistic inversion process in geophysical applications.
Methodology
The authors utilize Flow Matching, a deep learning technique, to iteratively transform a known initial distribution into a target distribution relevant for probabilistic inversion. This involves training Continuous Normalizing Flows (CNF) and solving an Ordinary Differential Equation (ODE) during inference. The method is evaluated through two case studies: a simple 2D velocity model and the OpenFWI dataset, illustrating its effectiveness in probabilistic inversion scenarios.
Results
The application of Flow Matching to the 2D velocity model successfully demonstrated the method's ability to recover possible seismic velocity distributions from observed data. The case studies indicated that the approach could efficiently handle complex seismic models, providing a viable alternative to traditional deterministic inversion methods.
Implications
The findings suggest that Flow Matching can significantly improve the efficiency and accuracy of probabilistic inversion in geophysical applications, potentially leading to better seismic imaging and interpretation. This method could be extended to other areas of geophysics and beyond, where uncertainty quantification is critical.
Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment
NLP
Large Language Models
Optimization
- Optimizer choice significantly influences emergent misalignment severity, with a 7× variation in misalignment rates observed.
- Model size and family have negligible effects on emergent misalignment when using the Adam optimizer.
- Final log training loss is a strong predictor of alignment, but the optimizer's role becomes more critical after extensive training.
- Optimizers that produce flatter singular value spectra in learned weights better preserve alignment.
Read more
Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment
Summary
This paper investigates the phenomenon of emergent misalignment (EM) in large language models (LLMs), where fine-tuning on narrowly misaligned tasks can lead to broadly misaligned behavior on unrelated prompts. The authors conduct a systematic study to characterize the sensitivity of EM to various training choices, particularly focusing on the choice of optimizer. Through extensive experiments involving multiple Qwen3 models, optimizers, datasets, and batch sizes, they find that the optimizer choice has the most significant impact on EM severity, with a 7× variation in misalignment rates. Model size and family show negligible effects. The study reveals that final log training loss is a strong predictor of alignment, but after significant training, the optimizer becomes more critical than training loss. The authors also explore the relationship between optimizer-induced singular value spectra and alignment preservation, proposing that optimizers producing flatter spectra better maintain alignment. They validate this by introducing a loss term that encourages a flatter singular value spectrum, which improves alignment for certain optimizers without significantly affecting training loss. Overall, the findings underscore the importance of optimizer selection in managing emergent misalignment in LLMs.
Methodology
The authors performed a systematic study involving a series of sweeps across different Qwen3 model sizes, optimizers, datasets, and batch sizes. They analyzed the relationship between training loss and alignment, tracked alignment dynamics during training, and examined the singular values of learned weights to understand the impact of optimizers on emergent misalignment. They also tested a modified training loss to incentivize flatter singular value spectra.
Results
The study found that optimizer choice has the largest effect on emergent misalignment, with a 7× spread in misalignment rates. Model scale and family had negligible effects on misalignment when using the Adam optimizer. The final log training loss explained 74% of alignment variance, with the optimizer stratification capturing nearly all residual variance. The analysis of training dynamics revealed distinct trajectories for each optimizer in loss-alignment space, emphasizing the importance of optimizer choice over training loss after significant training. The introduction of a loss term to flatten singular value spectra improved alignment for certain optimizers without a significant increase in training loss.
Implications
These findings highlight the critical role of optimizer selection in the fine-tuning of large language models, particularly in safety-critical applications. By understanding and mitigating emergent misalignment, practitioners can improve the reliability and alignment of LLMs in various contexts, potentially leading to safer AI systems.
Safe Online Learning via Smooth Safety-Structured Policy Composition
Reinforcement Learning
Robotics
Theory
- AutoSafe integrates safety monitoring and intervention into the action generation process, enhancing smooth learning dynamics.
- The architecture allows for risk-dependent transitions between performance and safety, ensuring continuous interaction.
- Empirical results show AutoSafe provides strong safety assurance while maintaining stable learning dynamics.
- The method is validated in both simulated environments and real-world applications, such as a cart-pole system.
Read more
Safe Online Learning via Smooth Safety-Structured Policy Composition
Summary
The paper addresses the challenge of safe online reinforcement learning (RL), where agents must adhere to safety constraints while optimizing their performance. Traditional methods either enforce strict safety through abrupt action interventions, which can disrupt learning, or use soft constraints that allow temporary safety violations, risking safety in critical applications. The authors propose AutoSafe, a novel safety-aware policy architecture that integrates structured safety monitoring and intervention directly into the action generation process. This integration allows for smooth transitions between performance-driven and safety-preserving behaviors, facilitating continuous online interaction and learning. The empirical evaluation across various continuous-control benchmarks demonstrates that AutoSafe achieves strong safety enforcement without compromising learning smoothness. Additionally, the effectiveness of AutoSafe is validated in a real-world cart-pole system, showcasing its practical applicability in safety-critical environments.
Methodology
The authors developed AutoSafe by embedding risk monitoring and safe intervention as structural inductive biases within a differentiable policy composition framework. This approach allows for the backpropagation of learning signals through the safety mechanism, enabling stable policy optimization under safety constraints. The architecture also includes a safe policy prior that guides interventions, providing reliable fallback behaviors and ensuring smoother state trajectories.
Results
AutoSafe was compared against various safety filter-based approaches and safe reinforcement learning baselines across multiple benchmarks. The results indicated that AutoSafe consistently exhibited stable learning dynamics, achieved strong safety assurance comparable to standard safety filters, and matched or outperformed state-of-the-art safe learning methods in terms of task performance. The real-world cart-pole training experiment further validated its practical effectiveness.
Implications
The proposed AutoSafe architecture has significant implications for the deployment of reinforcement learning agents in safety-critical applications, such as robotics and autonomous systems, where maintaining safety while optimizing performance is crucial. Its ability to ensure smooth learning dynamics while enforcing safety constraints can enhance the reliability and robustness of RL systems in real-world scenarios.
Amplifying Membership Signal Through Chained Regeneration
Generative Models
Multimodal
Theory
- Introduction of MADreMIA, a framework for enhancing membership inference through iterative regeneration.
- Demonstration that chained generations yield stronger membership signals than one-shot methods.
- Identification of 're-members' and 're-non-members' to differentiate between training data and unseen samples.
- Comprehensive evaluations across multiple generative model families showing improved inference efficiency.
Read more
Amplifying Membership Signal Through Chained Regeneration
Summary
This paper addresses the critical issue of sample verification in large generative models, which often memorize training data, posing risks for privacy and copyright. The authors introduce MADreMIA, a model-agnostic framework that enhances membership inference attacks (MIA) and dataset inference (DI) through an innovative approach called chained regeneration. Unlike traditional methods that rely on one-shot generations, MADreMIA utilizes iterative trajectories where each output serves as the subsequent input, thereby amplifying the membership signal. The framework distinguishes between 're-members' (training data) and 're-non-members' (unseen data), demonstrating that the former maintains coherence and stability during regeneration, while the latter degrades rapidly. The authors provide comprehensive evaluations across various model families, including image autoregressive models, diffusion models, and language models, showing that MADreMIA significantly improves the robustness of membership signals compared to standard one-shot methods. This work not only highlights the potential of recursive self-generation as a diagnostic tool but also proposes a scalable solution for effective membership and dataset inference without the need for expensive shadow model training.
Methodology
The authors propose a framework that leverages chained generations, where outputs from a generative model are recursively fed back as inputs. This iterative process allows for the amplification of membership signals by measuring the consistency of outputs over multiple generations, distinguishing between samples that were part of the training set and those that were not.
Results
The results indicate that MADreMIA significantly enhances the robustness of membership inference signals, with 're-members' exhibiting higher coherence and slower degradation compared to 're-non-members'. The framework shows improved performance across various model families, including image autoregressive models, diffusion models, and language models, outperforming traditional one-shot inference methods.
Implications
The findings suggest that MADreMIA can be a powerful tool for privacy auditing and copyright enforcement in generative models, providing a scalable and efficient method for detecting training data membership. This has significant implications for the responsible deployment of generative AI technologies, particularly in sensitive domains such as healthcare and content creation.
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning
Reinforcement Learning
- TRIAGE introduces a role-typed credit assignment framework for agentic RL.
- It classifies actions into four semantic roles to improve credit assignment accuracy.
- The framework outperforms standard GRPO and other baseline methods in multiple benchmarks.
- TRIAGE reduces unnecessary actions in successful trajectories, enhancing efficiency.
Read more
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning
Summary
The paper introduces TRIAGE, a novel framework for credit assignment in agentic reinforcement learning (RL) that addresses the limitations of standard Group Relative Policy Optimization (GRPO). Traditional GRPO assigns uniform credit based on final outcomes, which can misrepresent the value of individual actions within a trajectory. TRIAGE enhances this by introducing a role-typed credit assignment system, where a structured judge classifies actions into four categories: decisive progress, useful exploration, no-progress infrastructure, and regression. This classification allows for more nuanced credit assignment, correcting the blind spots of outcome-only credit systems. The authors demonstrate that TRIAGE significantly improves success rates across multiple benchmarks (ALFWorld, Search-QA, and WebShop) compared to GRPO and other baseline methods. The framework not only reduces the number of environment-facing actions but also emphasizes the importance of exploration in the learning process, leading to more efficient agent behavior.
Methodology
TRIAGE employs a structured judge, which uses a bounded local context to classify environment-facing actions into semantic roles. These roles inform a fixed role-conditioned rule that adjusts segment-level process rewards, allowing for a more accurate credit assignment based on the specific contributions of each action within a trajectory.
Results
TRIAGE demonstrated improved success rates over GRPO across two policy models and three benchmarks (ALFWorld, Search-QA, WebShop). The framework reduced environment-facing actions by 10.4% and 14.8% relative to GRPO, indicating a more efficient decision-making process. Ablation studies confirmed that the gains were primarily due to the role typing mechanism rather than simply adding dense rewards.
Implications
The TRIAGE framework has the potential to enhance the performance of agentic reinforcement learning systems by providing a more nuanced understanding of action contributions. This could lead to more efficient training processes and improved agent behavior in complex environments, making it applicable in various domains such as robotics, autonomous systems, and interactive AI.
Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents
NLP
Large Language Models
Generative Models
- LLM4MOF enables interpretable design of MOFs without requiring extensive property-labeled datasets.
- The framework operates autonomously through a closed-loop process, refining hypotheses over multiple iterations.
- It successfully identifies high-performing MOFs and generates new structures de novo.
- LLM4MOF outperforms random search and genetic algorithms in efficiency and effectiveness.
Read more
Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents
Summary
The paper presents LLM4MOF, a novel closed-loop framework that leverages large language model (LLM) agents for the inverse design of metal-organic frameworks (MOFs). Traditional methods for MOF design face challenges due to the vast combinatorial space and the high cost of obtaining property labels. LLM4MOF addresses these issues by employing two agents: one that generates interpretable design hypotheses based on chemical principles and another that translates these hypotheses into constraints for selecting candidate MOFs. The framework operates in two modes: database mode, where it identifies top-performing structures without prior knowledge of property values, and discovery mode, where it proposes new MOFs de novo. The closed-loop process iterates through hypothesis generation, constraint translation, candidate matching, hypothesis testing, and feedback refinement, allowing for efficient exploration of the design space. The results demonstrate that LLM4MOF can outperform traditional search methods, achieving significant performance in various adsorption, separation, and electronic-structure tasks with minimal evaluations.
Methodology
LLM4MOF employs a multi-agent system where one agent generates design hypotheses based on a natural-language objective, and a second agent translates these into constraints for candidate MOF selection. The framework iteratively tests and refines hypotheses through simulation-based evaluations, allowing for both database-driven and discovery-driven design.
Results
The LLM4MOF framework effectively concentrated its search on top-performing MOFs across six different tasks, achieving significant results within approximately 400 property evaluations. In discovery mode, it autonomously proposed and validated new MOFs, demonstrating a compact-micropore design principle under live simulation conditions.
Implications
The findings suggest that LLMs can facilitate the interpretable and efficient design of complex materials like MOFs, potentially accelerating the discovery of new materials for applications in gas separation, catalysis, and energy storage without the need for extensive prior datasets.
Golden Hour Divide: Trauma Care Accessibility and Resource Vulnerability in Sri Lanka
- Significant disparities in trauma care accessibility exist across Sri Lanka, particularly in the Northern and Eastern provinces.
- The study introduces a framework using spatial analysis to quantify gaps in emergency care resources.
- Four policy-actionable archetypes of districts are identified based on their healthcare resource availability and clinical needs.
- Improving accessibility by 25% in high-priority areas could reduce the national need-gap by 9.65%.
Read more
Golden Hour Divide: Trauma Care Accessibility and Resource Vulnerability in Sri Lanka
Summary
This study investigates the accessibility of trauma care in Sri Lanka, focusing on the critical 'Golden Hour' for emergency medical interventions. Despite improvements in pre-hospital services, significant disparities in access to definitive care persist across the country's 25 districts. The authors employ a data-driven framework utilizing national epidemiological data and terrain-aware H3 hexagonal modeling to assess accessibility for seven critical medical conditions. Key metrics include the Spatial Gap (Gd), Need-Gap Index (NGI), and Lethality Ratio (Lr), which collectively highlight systemic inefficiencies in healthcare delivery. The study categorizes districts into four archetypes based on their resource availability and clinical needs, revealing severe service deficits in the Northern and Eastern provinces. The findings suggest that enhancing accessibility in high-priority areas could significantly reduce the national need-gap, providing a roadmap for equitable healthcare distribution. The study emphasizes the importance of specialist availability over mere bed capacity in addressing systemic pressures in the healthcare system.
Methodology
The authors utilized national epidemiological data and terrain-aware H3 hexagonal modeling to analyze accessibility. They defined key metrics such as Spatial Gap (Gd), Need-Gap Index (NGI), and Lethality Ratio (Lr) to evaluate healthcare resource distribution. Unsupervised K-Means clustering was applied to categorize districts into four archetypes based on their healthcare accessibility and resource availability.
Results
The analysis revealed that severe service deficits exist in the Northern and Eastern provinces, with spatial gaps exceeding 70%, making timely care within the Golden Hour nearly impossible. The study found that improving accessibility in prioritized districts could significantly reduce the national need-gap, highlighting the critical role of specialist availability in the healthcare system.
Implications
The findings provide a strategic framework for policymakers to address healthcare inequities in Sri Lanka. By redistributing specialists and improving access in underserved regions, the study aims to enhance emergency care outcomes and reduce preventable mortality.
Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps
Generative Models
Efficient ML
Theory
- SCALLOP introduces a Hutchinson-free likelihood distillation objective, improving scalability and reducing variance.
- The method achieves up to 100× reduction in training variance and faster convergence compared to F2D2.
- SCALLOP demonstrates significant speed improvements in inference time, being 10× faster than normalizing flows.
- Empirical results show consistent performance gains in both molecular and image domains.
Read more
Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps
Summary
This paper introduces SCAlable LikeLihood distillation of flOw maPs (SCALLOP), a novel approach to generative modeling that enhances the efficiency of likelihood estimation in flow-based models. Building on the F2D2 framework, SCALLOP replaces the traditional likelihood distillation loss with a more scalable and lower-variance objective that does not rely on Hutchinson's trace estimator. This innovation allows for faster convergence and improved performance in generating high-quality samples while significantly reducing training variance. The authors demonstrate SCALLOP's effectiveness in molecular science applications and on image datasets, achieving up to 10× speedup in inference time compared to existing methods, while maintaining competitive performance. The results indicate that SCALLOP not only streamlines the training process but also enhances the expressivity and efficiency of generative models, addressing a critical gap in the current literature on likelihood evaluation.
Methodology
SCALLOP builds on the F2D2 framework by reformulating the likelihood distillation loss to create a scalable, vectorized objective. This new approach eliminates the reliance on Hutchinson's estimator, allowing for efficient training and inference. The model distills both sampling and likelihood computation into a single neural network evaluation, enhancing expressivity and reducing computational overhead.
Results
SCALLOP significantly outperforms its predecessor, F2D2, with up to 100× reduction in training variance and faster convergence. In empirical tests, SCALLOP achieves 10× faster inference times compared to state-of-the-art Boltzmann Generators while providing better performance metrics across molecular and image datasets.
Implications
The advancements presented in SCALLOP have the potential to revolutionize applications in molecular science and other fields requiring efficient likelihood estimation. By improving the speed and accuracy of generative models, SCALLOP can facilitate enhanced sampling methods and free energy estimations, which are critical in scientific research and industrial applications.
A Transferable Learned Temporal Prior for Transmission Reconstruction and Decision-Relevant Uncertainty in Real Outbreak Labels
Time Series
Graph Learning
Theory
- Introduces a transferable temporal prior for outbreak transmission reconstruction.
- Demonstrates significant performance improvement over traditional parametric baselines.
- Identifies a high level of uncertainty in epidemiological transmission labels.
- Shows that retaining uncertain transmission links alters source prioritization.
Read more
A Transferable Learned Temporal Prior for Transmission Reconstruction and Decision-Relevant Uncertainty in Real Outbreak Labels
Summary
This paper addresses the challenge of outbreak transmission reconstruction, which often relies on deterministic epidemiological timing and transmission labels that have not been systematically evaluated. The author trained a logistic regression temporal prior using data from eleven disease families, locking all parameters before accessing any target outbreak data. This locked prior was then applied to a strict Andes virus (ANDV) parent-ranking benchmark consisting of 29 tasks. The results demonstrated that the locked prior achieved a mean reciprocal rank (MRR) of 0.571 and a Top-1 accuracy of 37.9%, significantly outperforming the best source-trained parametric baseline. Additionally, a phylogenetic audit of 75 mpox inter-host pairs revealed that a substantial proportion (54.67%) were genomically unresolved or unsupported, indicating that uncertainty in transmission labels is prevalent. The study further showed that retaining uncertain edges in transmission graphs can alter source prioritization for interventions. Overall, the findings highlight the importance of recognizing and quantifying uncertainty in outbreak data, suggesting that deterministic treatment of uncertain transmission labels can lead to misleading conclusions in outbreak reconstruction.
Methodology
The study employed logistic regression to train a temporal prior on a multi-disease benchmark, locking parameters before accessing target outbreak data. The performance was evaluated using a strict parent-ranking benchmark for ANDV, and a phylogenetic audit was conducted on mpox inter-host pairs to assess label reliability.
Results
The locked temporal prior achieved an MRR of 0.571 and Top-1 accuracy of 37.9%, outperforming the best source-trained parametric baseline. The phylogenetic audit revealed that 54.67% of inter-host pairs were genomically unresolved, and retaining uncertain edges in transmission graphs shifted source prioritization significantly.
Implications
The findings suggest that recognizing and quantifying uncertainty in epidemiological data is crucial for accurate outbreak reconstruction and intervention prioritization. This approach could enhance outbreak response strategies and improve public health decision-making.
Depth Exploration for LLM Decoding
NLP
Large Language Models
Efficient ML
- DEX replaces single-depth selection with parallel exploration of multiple candidate depths.
- The method preserves lossless equivalence to standard autoregressive decoding while reducing computational costs.
- Empirical results show DEX outperforms existing depth-adaptive methods and achieves competitive throughput.
- The concept of Earliest Available Depth (EAD) is introduced to quantify token readiness.
Read more
Depth Exploration for LLM Decoding
Summary
The paper introduces Depth Exploration Decoding (DEX), a novel lossless decoding algorithm for autoregressive large language models (LLMs) that addresses inefficiencies in existing depth-adaptive decoding methods. Traditional approaches select a single exit depth for token generation, which can lead to wasted computations or incorrect predictions. DEX improves upon this by exploring multiple candidate depths in parallel, validating them against the final-depth output, and committing only the final-depth token. This method allows for a more efficient use of computational resources by reducing the risk of premature exits and enabling the reuse of valid branches. The authors define the concept of Earliest Available Depth (EAD) to quantify token readiness along the depth axis, highlighting the limitations of previous selection-based strategies. Through empirical evaluations, DEX demonstrates superior performance compared to existing depth-selection baselines and achieves competitive throughput against speculative and distributed decoding methods, particularly as the granularity of explored depths increases.
Methodology
The DEX algorithm employs an expand–commit–collapse procedure. It expands candidate branches across multiple depth stages, commits only the token validated by the final-depth model, and collapses the exploration lattice to retain only reusable branches. This approach allows for efficient parallel depth exploration while maintaining the integrity of autoregressive decoding.
Results
DEX outperformed representative depth-selection baselines and achieved competitive end-to-end throughput against speculative and distributed decoding methods. The performance improved as the explored depths became finer, indicating a scalable advantage in exploiting the depth axis of LLM decoding.
Implications
The findings suggest that DEX could significantly enhance the efficiency of LLM decoding in practical applications, reducing latency and computational costs. This could lead to faster response times in applications such as conversational agents, content generation, and other NLP tasks that rely on LLMs.
Interface-Aware Neural Newton Preconditioning for Robust Cohesive Zone Model Simulations
Optimization
Theory
- Introduction of IA-NNP to enhance CZM simulation robustness.
- Preservation of original traction-separation laws while improving convergence.
- Development of two solver-level implementations for effective preconditioning.
- Demonstrated improved performance in numerical benchmarks over traditional methods.
Read more
Interface-Aware Neural Newton Preconditioning for Robust Cohesive Zone Model Simulations
Summary
This paper introduces the Interface-Aware Neural Newton Preconditioner (IA-NNP) aimed at improving the robustness of Cohesive Zone Model (CZM) simulations, which are critical for modeling interface fractures in aerospace composite structures. Traditional implicit quasi-static finite element analyses face challenges such as negative interface tangents and solution jumps during cohesive softening, leading to convergence issues. Existing solutions often modify the effective constitutive response or increase computational costs. The IA-NNP reformulates manual Newton-Raphson (NR) modifications into a learned, state-dependent correction that preserves the original traction-separation law and enhances convergence without altering the underlying cohesive law. The authors developed two solver-level implementations: IA-NNP-Init for providing learned initial guesses and IA-NNP-NL for applying nonlinear preconditioning during iterations. The method utilizes interface graph features to encode relevant variables and is designed to be bounded and confidence-gated. Numerical experiments demonstrate that IA-NNP significantly improves convergence in challenging increments, enhances branch recovery, and reduces failure rates compared to standard NR methods, while maintaining the accuracy of the force-displacement response.
Methodology
The methodology involves the development of the IA-NNP, which reformulates manual NR modifications into a learned correction mechanism. The approach leverages interface graph features to inform the preconditioning process, ensuring that the original cohesive law remains intact. Two implementations were created: IA-NNP-Init for initial guess lifting and IA-NNP-NL for iteration-level preconditioning.
Results
Numerical studies showed that IA-NNP significantly improved convergence rates for difficult increments, enhanced recovery from branch points, and reduced the frequency of failures compared to standard NR and manual NR modifications, all while preserving the accuracy of the force-displacement response.
Implications
The findings suggest that learned interface-local preconditioning could serve as a valuable approach for enhancing the robustness of CZM simulations in various engineering applications, particularly in aerospace structures where interface failures are critical.
When to Truncate a Feature Ranking: A Residual-Overlap Stopping Rule for Subset Selection
Theory
Efficient ML
Interpretability
- Introduces a calibrated residual-overlap stopping rule for feature selection.
- Utilizes the Bhattacharyya coefficient to measure class-conditional marginal separation.
- Provides a single class-independent subset of features based on a statistical interpretation.
- Demonstrates effectiveness on high-dimensional genomic datasets, achieving significant dimensionality reduction.
Read more
When to Truncate a Feature Ranking: A Residual-Overlap Stopping Rule for Subset Selection
Summary
This paper addresses the challenge of determining when to stop selecting features from a ranked list in supervised feature selection. While feature rankings are widely used due to their simplicity and interpretability, the process of truncating these rankings often lacks a clear statistical justification. The author proposes a distributional framework that transforms feature rankings into class-independent subsets using a calibrated stopping rule based on residual overlaps. The method measures marginal separation using the Bhattacharyya coefficient and accumulates overlaps under a product marginal model. The stopping rule retains the shortest prefix of the ranking where the residual product overlap falls below a specified threshold for all relevant class contrasts. This approach provides a statistically interpretable method for feature selection, allowing for significant dimensionality reduction while maintaining predictive performance. Empirical results on high-dimensional genomic datasets demonstrate that the proposed method can effectively reduce thousands of variables to a few dozen without sacrificing accuracy compared to using all features.
Methodology
The proposed methodology involves calculating the Bhattacharyya coefficient for each variable and class pair to assess marginal separation. The residual product overlap is accumulated along the ranked list of features, and selection stops at the shortest prefix where this overlap falls below a predefined threshold for all class contrasts. The method is designed to be computationally efficient, requiring only one-dimensional marginal overlap estimates.
Results
The empirical evaluation on high-dimensional genomic datasets showed that the proposed stopping rule could reduce the number of features from tens of thousands to a few dozen while achieving predictive performance that is statistically comparable to using all features. This highlights the method's effectiveness in high-dimensional settings.
Implications
The proposed feature selection method has significant implications for high-dimensional classification tasks, particularly in fields like bioinformatics and medical diagnosis, where interpretability and computational efficiency are crucial. It provides a systematic approach to feature selection that can enhance model performance and reduce complexity.
Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts
NLP
Generative Models
Theory
- Deterministic few-step generation fails for text latents due to geometric issues related to sharp categorical readouts.
- DABI and CCI diagnostics reveal significant differences in performance between text and image decoders.
- Two mechanisms, categorical commitment and stochastic re-injection, allow some systems to escape deterministic transport limitations.
- The paper establishes a non-commitment theorem and sharp transport laws that inform the accuracy-depth-stiffness tradeoff.
Read more
Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts
Summary
This paper investigates the failure of deterministic few-step generation in continuous text latents compared to its success in continuous image latents. The author argues that the issue is geometric rather than due to training or scaling deficiencies. Specifically, the paper demonstrates that a smooth deterministic map cannot effectively resolve discrete branch choices before a sharp categorical readout, leading to incoherent text generation. The author introduces two diagnostics, DABI (Decoder Amplification of Boundary-aligned Inputs) and CCI (Categorical Commitment Index), to analyze the performance of text and image decoders. The findings reveal that text decoders amplify perturbations near decision boundaries significantly more than image decoders, indicating a sharpness issue in text readouts. The paper also discusses two mechanisms that allow some systems to escape the limitations of deterministic transport: categorical commitment and stochastic re-injection. The author provides theoretical results, including a non-commitment theorem and sharp transport laws, which highlight the trade-offs between accuracy, depth, and stiffness in the generation process. Overall, the paper emphasizes the structural challenges in text generation and proposes a framework for understanding the differences between image and text latent generation.
Methodology
The author employs theoretical analysis to derive results regarding the performance of deterministic few-step generation in text and image latents. Key concepts include the use of DABI and CCI for diagnostics, as well as the formulation of the non-commitment theorem and sharp transport laws. Empirical evaluations on published checkpoints are conducted to validate the theoretical predictions.
Results
The paper presents several key results, including the non-commitment theorem that quantifies the flip rate of tokens in text generation, demonstrating that failure is driven by decoder sharpness rather than transport accuracy. The DABI metric shows that text decoders amplify boundary-aligned perturbations significantly more than image decoders, with values exceeding 100,000 for text compared to approximately 1 for images. Additionally, the paper confirms that stochastic re-injection and categorical commitment can mitigate the issues faced by deterministic transport.
Implications
The findings suggest that improvements in text generation models may require addressing the structural challenges identified in the paper. Understanding the geometric properties of decoders can inform the design of more effective text generation systems, potentially leading to better coherence and quality in generated text. The insights into the accuracy-depth-stiffness tradeoff may also guide future research in generative modeling.
Blackknife: Hard-Label Query-Limited Black-Box Attacks on Heterogeneous Graph Neural Networks
Graph Learning
- Blackknife operates under strict black-box conditions without access to model internals or complete graph structures.
- The framework constructs a local surrogate model to generate effective perturbations for attacks.
- Blackknife demonstrates high attack success rates across multiple benchmark datasets.
- The method remains effective against topology-based defenses, indicating significant vulnerabilities in HGNNs.
Read more
Blackknife: Hard-Label Query-Limited Black-Box Attacks on Heterogeneous Graph Neural Networks
Summary
The paper introduces Blackknife, a novel framework designed for executing hard-label, query-limited black-box attacks on heterogeneous graph neural networks (HGNNs). Unlike previous methods that often require access to model gradients or complete graph structures, Blackknife operates under strict constraints, utilizing only locally observable one-hop heterogeneous structures and a limited number of hard-label queries. The approach begins by constructing a local relation-aware surrogate model based on observable neighborhoods. It then transforms discrete edge modifications into continuous soft weights, which are optimized using projected gradient descent. The final perturbations are discretized into relation-preserving structural rewiring operations, verified through limited hard-label feedback from the victim model. Extensive experiments on benchmark datasets (ACM, DBLP, IMDB) demonstrate that Blackknife achieves high attack success rates against various HGNN models, highlighting the vulnerability of HGNNs to local structure-limited black-box attacks, even in the presence of topology-based defenses.
Methodology
Blackknife constructs a local relation-aware surrogate model from observable neighborhoods of the target node. It optimizes perturbations by relaxing discrete edge modifications into continuous soft weights, which are then refined through projected gradient descent. The final perturbations are discretized into structural rewiring operations based on limited hard-label feedback from the victim model.
Results
The experiments conducted on ACM, DBLP, and IMDB datasets show that Blackknife consistently achieves strong attack success rates against various HGNN models. The results also reveal that the method is effective even when facing topology-based defenses, underscoring the inherent vulnerabilities of HGNNs.
Implications
The findings suggest that HGNNs are susceptible to adversarial attacks in practical scenarios, which could have serious implications for applications in sensitive domains such as finance and biology. The study highlights the necessity for developing more robust defenses against such attacks in real-world deployments.
Probing Memorization of Tabular In-Context Learning
Large Language Models
Theory
Interpretability
- Introduces ICLMEM, a framework for probing memorization in LTMs.
- Detects moderate memorization signals in LTMs across various tasks.
- Memorization is strongest in low-cardinality and binary tasks.
- Memorization signals largely disappear under realistic training conditions.
Read more
Probing Memorization of Tabular In-Context Learning
Summary
This paper investigates the memorization dynamics of Large Tabular Models (LTMs) that utilize in-context learning (ICL) for tabular tasks. While previous research has highlighted the unintentional memorization of training data in large language models (LLMs), the authors aim to systematically assess this phenomenon in LTMs. They introduce a novel probing framework called ICLMEM, which distinguishes between context-based predictions and parametric memorization. The framework employs zero-information multiple-choice contexts to eliminate valid contextual patterns, compelling the model to rely on its parametric memory. Through a controlled fine-tuning setup, the authors establish membership ground truth and address common pitfalls such as distribution shift and feature contamination. Their evaluation reveals moderate memorization signals in 8 out of 10 tasks, particularly for low-cardinality and binary tasks, although these signals diminish under realistic training conditions. The findings underscore the importance of understanding memorization in LTMs, especially in contexts involving sensitive data, and suggest that appropriate measures must be taken to protect such data.
Methodology
The authors developed the ICLMEM framework to probe LTMs by manipulating the context provided to the model. They conducted controlled fine-tuning to establish membership ground truth and evaluated the model's predictions under various conditions to assess memorization dynamics. The study accounted for common pitfalls in data handling and used metrics such as AUC and TPR to quantify memorization signals.
Results
The evaluation indicated that LTMs exhibited moderate memorization signals in 8 out of 10 tasks, with AUC values reaching up to 0.67 and true positive rates at 1% false positive rates exceeding 0.1. The strongest memorization signals were observed in tasks with low cardinality and binary outcomes, although these signals diminished significantly under more realistic training scenarios.
Implications
The findings have significant implications for the deployment of LTMs in sensitive applications, such as healthcare and finance, where data privacy is paramount. The study emphasizes the necessity for implementing protective measures against potential data leakage through model memorization, informing practices in differential privacy and data pre-processing.
Can Tabular In-Context Learners Generalize to Biomolecular Property Prediction?
Theory
Efficient ML
- Tabular in-context learners can effectively generalize to biomolecular property prediction tasks.
- The performance of these models is highly dependent on the choice of sequence or molecular representation.
- The study provides a systematic evaluation of tabular foundation models in scientific prediction settings.
- TabPFN3 and TabICL show competitive results in protein fitness regression and small-molecule classification.
Read more
Can Tabular In-Context Learners Generalize to Biomolecular Property Prediction?
Summary
This paper investigates the ability of tabular in-context learners, specifically TabPFN3 and TabICL, to generalize in the domain of biomolecular property prediction, which is critical for protein engineering and small-molecule design. The authors highlight the shift in focus from representation learning to developing data-efficient predictors that can operate effectively in few-shot scenarios. Despite the initial skepticism regarding the transferability of tabular models trained on synthetic data to biomolecular tasks, the study finds that these models perform competitively. The evaluation is conducted across two domains: protein fitness regression and small-molecule classification. For protein fitness, the authors utilize ESMC representations and benchmark the tabular models against supervised baselines on datasets like ProteinGym and a diverse esterase dataset. In small-molecule classification, they assess various learner-representation pairs, revealing that the choice of representation significantly influences performance. The findings suggest that while tabular foundation models can excel in biomolecular predictions, their effectiveness is closely tied to the representation used, emphasizing the importance of careful selection in predictive modeling.
Methodology
The authors benchmarked tabular foundation models (TabPFN3 and TabICL) against supervised baselines using fixed ESMC representations for protein fitness regression and various molecular descriptor views for small-molecule classification. They evaluated the models across multiple datasets, including ProteinGym, TDC ADMET, MoleculeNet, FS-Mol, and DrugOOD, focusing on few-shot learning and out-of-distribution generalization.
Results
The results indicate that tabular in-context learners are competitive in predicting protein fitness and small-molecule properties, with performance varying based on the representation used. In protein fitness regression, the tabular models performed well against established baselines. For small-molecule classification, no single model pairing dominated, highlighting the critical role of representation choice.
Implications
The findings suggest that tabular foundation models can be effectively utilized in biomolecular property prediction, which could enhance the efficiency of protein engineering and drug design processes. The emphasis on representation choice also points to the need for careful consideration in model selection for specific tasks.
Multistage Defer Trees for Hybrid Interpretability: If at First You Can't Succeed, Tree Again
Interpretability
- Introduction of Multistage Defer Trees (MDTs) for improved interpretability and accuracy.
- Iterative training algorithm that narrows the deferral region while enhancing model performance.
- Ability to compress MDTs into simpler representations, maintaining interpretability.
- Demonstrated improved accuracy-deferral-sparsity trade-offs compared to existing methods.
Read more
Multistage Defer Trees for Hybrid Interpretability: If at First You Can't Succeed, Tree Again
Summary
The paper introduces Multistage Defer Trees (MDTs), a novel approach to bridge the gap between model interpretability and predictive accuracy in machine learning. Traditional decision trees can perform well in noisy datasets, but complex ensemble models often outperform them at the cost of interpretability. MDTs aim to classify as many samples as possible using a limited number of sparse decision trees while deferring difficult cases to subsequent trees or a black box model. The authors propose an iterative training algorithm that focuses on subsets of data where simpler models struggle, allowing for the creation of specialized trees that generalize effectively. This method not only reduces the deferral rate but also maintains high accuracy, even when a black box model is necessary. The paper also discusses techniques for compressing MDTs into simpler representations, enhancing their interpretability without sacrificing performance. Overall, MDTs represent a significant advancement in achieving a balance between accuracy and interpretability in machine learning models.
Methodology
The authors developed an iterative optimization procedure for training MDTs, focusing on subsets of data where simpler models underperform. This approach allows the model to adaptively allocate complexity and reduce the deferral rate. Additionally, algorithms for compressing MDTs into sparse representations were introduced, ensuring interpretability even when deferring to more complex models.
Results
The MDTs achieved performance comparable to complex tree-based ensembles while significantly reducing the number of samples deferred to black box models. The iterative training process effectively narrowed the deferral region, resulting in a model that maintains high accuracy with minimal reliance on opaque models.
Implications
The development of MDTs has significant implications for fields requiring interpretable machine learning models, such as healthcare and finance, where understanding model decisions is crucial. The ability to balance accuracy and interpretability can enhance trust and usability in critical applications.
Randomized Exploration for Linear Bandits via Absolute Perturbations
Theory
Efficient ML
- Introduction of Absolute Thompson Sampling (ATS) to ensure optimism in expectation while maintaining computational efficiency.
- ATS achieves a regret bound of eO(d^(3/2)√K), matching existing bounds for Thompson Sampling.
- Ensemble Absolute Thompson Sampling (EATS) converges to UCB behavior as ensemble size increases, providing a practical interpolation between randomized and deterministic approaches.
- The proposed methods simplify the regret analysis compared to traditional TS approaches, avoiding complex anti-concentration arguments.
Read more
Randomized Exploration for Linear Bandits via Absolute Perturbations
Summary
This paper addresses the challenges of exploration in stochastic linear bandits, where a learner selects arms from a set and receives noisy linear rewards. The authors propose Absolute Thompson Sampling (ATS), a modification of Thompson Sampling (TS) that ensures optimism in expectation by using the absolute value of exploration noise. This approach retains the computational efficiency of TS while simplifying the regret analysis, allowing it to achieve a regret bound of eO(d^(3/2)√K), which matches existing bounds for TS. Additionally, the authors introduce Ensemble Absolute Thompson Sampling (EATS), which aggregates multiple absolute perturbations to converge to the Upper Confidence Bound (UCB) objective as the ensemble size increases. EATS demonstrates strong empirical performance even with moderate ensemble sizes, bridging the gap between randomized exploration and deterministic optimism. The work highlights a new perspective on designing efficient randomized algorithms for bandit problems, suggesting that careful shaping of exploration terms can yield algorithms that are both computationally efficient and theoretically sound.
Methodology
The authors modify the standard Thompson Sampling algorithm by replacing the signed exploration noise with its absolute value, resulting in Absolute Thompson Sampling (ATS). They further extend this to Ensemble Absolute Thompson Sampling (EATS), which aggregates multiple absolute perturbations. Theoretical analysis is conducted to derive regret bounds, and empirical evaluations are performed to assess the performance of the proposed methods.
Results
ATS achieves a regret bound of eO(d^(3/2)√K) with high probability, matching existing bounds for TS in linear bandits. EATS demonstrates that as the ensemble size increases, its performance converges to that of UCB, recovering UCB behavior in the limit. Empirical results show that even small ensemble sizes can lead to strong performance improvements.
Implications
The findings suggest new avenues for designing efficient randomized algorithms in bandit problems, potentially leading to better performance in practical applications such as online learning, recommendation systems, and adaptive experimentation.
Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning
Reinforcement Learning
Optimization
Efficient ML
- Introduces a fixed-day RL environment for HPC data centers in wind farms.
- Identifies and addresses a credit-assignment problem in pure RL applications.
- Evaluates optimization-based Imitation Learning and potential-based Reward Shaping as countermeasures.
- Demonstrates strong empirical performance improvements with RL techniques.
Read more
Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning
Summary
This paper explores the application of Reinforcement Learning (RL) as an online controller for workload shifting in high-performance computing (HPC) data centers integrated with wind farms. The authors introduce a reproducible simulation framework that incorporates synthetic wind and price signals, focusing on a minimal case of one wind turbine and one co-located data center. The study identifies a credit-assignment problem in pure RL, where agents underutilize available wind energy early in the day due to delayed feedback. To address this, the authors evaluate two strategies: optimization-based Imitation Learning (IL) and potential-based Reward Shaping (RS). The results demonstrate that both strategies improve the performance of RL agents, particularly when using Proximal Policy Optimization (PPO) and a variant of Soft Actor-Critic (SAC) with on-policy updates. Despite achieving strong empirical performance, a gap remains between RL and the optimizer, attributed to the latter's offline planning capabilities. The findings provide a methodological baseline for extending this approach to more complex scenarios involving multiple sites and continuous operation.
Methodology
The authors developed a fixed-day simulation framework to model the interaction between wind energy availability and data center workload management. They implemented RL algorithms, specifically Proximal Policy Optimization (PPO) and a variant of Soft Actor-Critic (SAC), and tested them against optimization-based Imitation Learning and potential-based Reward Shaping to mitigate credit-assignment issues. The evaluation was conducted over a 200-day test set with controlled synthetic signals.
Results
The study found that both Imitation Learning and Reward Shaping significantly improved the performance of RL agents in managing workload shifts in the presence of curtailment-aware energy accounting. However, a performance gap remained when compared to the offline optimizer, highlighting the challenges of real-time decision-making in RL.
Implications
The findings suggest that RL can effectively manage workload shifting in HPC data centers integrated with renewable energy sources, potentially leading to more sustainable operations. The established framework can be extended to more complex scenarios, paving the way for future research in energy-efficient computing and smart grid integration.
ITSPACE: Monotone Gaussian Optimal Transport Updates
Optimization
Efficient ML
Theory
- ITSPACE optimizes the Bures-Wasserstein objective for covariance alignment using a proximal majorization-minimization approach.
- The method ensures that updates remain positive semidefinite and rank-constrained, suitable for real-time applications.
- ITSPACE outperforms existing methods in terms of speed and efficiency in achieving low BW-gap solutions.
- The paper provides theoretical guarantees for the method's performance, including bounds on deviations from exact descent.
Read more
ITSPACE: Monotone Gaussian Optimal Transport Updates
Summary
The paper introduces ITSPACE (Iterative Transport for Stable Proximal Alignment of Covariance Embeddings), a novel proximal majorization-minimization method designed for optimizing the Bures-Wasserstein (BW) objective, which is crucial for covariance alignment in machine learning applications. The authors highlight the significance of covariance matrices as compact descriptors of feature distributions, particularly in scenarios such as domain adaptation and Gaussian embeddings. ITSPACE directly optimizes the BW objective through closed-form updates, ensuring that each iteration maintains the positive semidefinite (PSD) structure of the covariance matrices. The method is particularly efficient for low-rank covariance updates, making it suitable for real-time applications where computational resources are limited. The authors provide a theoretical foundation for the method, including a sufficient-decrease inequality for exact arithmetic and a certificate-gap bound for inexact computations. Empirical evaluations demonstrate that ITSPACE achieves low BW-gap solutions significantly faster than existing methods, including BW-gradient descent and other covariance geometry-based approaches, making it a promising tool for covariance alignment tasks.
Methodology
ITSPACE employs a proximal majorization-minimization framework to optimize the Bures-Wasserstein objective. It utilizes a low-rank factorization of covariance matrices and iteratively updates these factors through closed-form solutions derived from a quadratic upper bound of the BW objective. The method incorporates a linear certificate from polar/Procrustes problems to ensure efficient updates while maintaining the PSD structure.
Results
The empirical results indicate that ITSPACE achieves significantly lower BW-gap solutions compared to traditional BW-gradient descent and other covariance alignment methods, demonstrating faster convergence and efficiency in real-world benchmarks.
Implications
The proposed method has potential applications in various machine learning domains where covariance alignment is crucial, such as domain adaptation, generative modeling, and probabilistic representation learning. Its efficiency makes it suitable for real-time systems with limited computational resources.
Hierarchical Global Attention (HGA)
NLP
Large Language Models
Efficient ML
- HGA is a drop-in replacement for dense causal attention, preserving original model parameters.
- It enables long-context transformers to operate efficiently at 64K tokens without retraining.
- The hierarchical routing mechanism reduces memory consumption and maintains performance.
- HGA achieves a minimal loss gap compared to dense attention while using only 3% sparsity.
Read more
Hierarchical Global Attention (HGA)
Summary
The paper introduces Hierarchical Global Attention (HGA), a novel approach designed as a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA maintains the original checkpoint parameters, avoiding the need for calibration or retraining. It is particularly effective for models like Qwen3-30B-A3B-Instruct-2507-FP8, enabling them to operate at a 64K-token context on a single RTX 5090 GPU, where traditional token-level key/value (K/V) storage is impractical. HGA employs a hierarchical two-level routing mechanism that first retrieves relevant chunks using compact RoPE-aware summaries, followed by a refinement step that routes to the most relevant groups before executing exact token-level attention. This method significantly reduces the number of tokens fetched while ensuring precise attention over the selected tokens, making it feasible to utilize RAM and NVMe storage effectively. The results indicate that HGA achieves a minimal loss in performance compared to dense attention, with a sparsity of only about 3%. The paper emphasizes that the quality gap is primarily influenced by long-context positional encoding rather than the routing algorithm itself, showcasing HGA's potential as a systems-level solution for long-context pretrained models without altering their learned representations.
Methodology
HGA utilizes a two-level hierarchical routing strategy, where it first retrieves relevant chunks of tokens using compact summaries and then refines the selection by routing to the most pertinent groups. This method allows for exact token-level attention to be computed only on the selected tokens, significantly reducing the memory footprint and computational overhead associated with traditional dense attention mechanisms.
Results
The implementation of HGA on the Qwen3-30B-A3B-Instruct-2507-FP8 model demonstrated a performance loss of only 0.018 nats compared to dense attention at 8K tokens. Additionally, it achieved a 2.72× speedup in training steps at 12K tokens. The model successfully maintained a 100% retrieval accuracy in a needle-in-a-haystack evaluation at 64K tokens without any fine-tuning.
Implications
HGA presents a significant advancement for long-context transformers, enabling their deployment in scenarios where GPU memory is a limiting factor. This approach can facilitate the use of large language models in applications requiring extensive context, such as document summarization, conversational agents, and other NLP tasks that benefit from long-range dependencies.
Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR
Reinforcement Learning
NLP
Large Language Models
- Introduces geometry-preserving orthonormal initialization for LoRA in RLVR.
- Demonstrates that orthonormal initialization minimizes performance gaps compared to full fine-tuning.
- Presents two new LoRA variants, LoRA-RLPO and LoRA-RLMO, which outperform standard LoRA.
- Provides theoretical insights into the instability of existing LoRA variants in RLVR.
Read more
Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR
Summary
This paper investigates the initialization strategies for Low-Rank Adaptation (LoRA) in the context of Reinforcement Learning with Verifiable Rewards (RLVR). While LoRA and its variants have shown promise in supervised fine-tuning (SFT), their performance under RLVR is less understood, with some variants like PiSSA and MiLoRA exhibiting instability and underperformance. The authors conduct a theoretical analysis revealing that orthonormal initialization minimizes the performance gap between LoRA and full fine-tuning. They propose two new variants, LoRA-RLPO and LoRA-RLMO, which utilize geometry-preserving orthonormal initialization. Experimental results demonstrate that these new methods stabilize training in RLVR and outperform standard LoRA, contrasting with the performance of PiSSA and MiLoRA. The findings also provide insights into the reasons behind the underperformance of existing methods in RLVR, which may have broader implications for low-rank adaptation techniques.
Methodology
The authors perform a theoretical analysis of LoRA in RLVR, focusing on the initialization of low-rank matrices. They propose orthonormal initialization strategies and evaluate their effectiveness through experiments on mathematical reasoning benchmarks, comparing the performance of their new variants against standard LoRA and existing methods.
Results
The proposed LoRA-RLPO and LoRA-RLMO variants show improved stability and performance in RLVR tasks compared to standard LoRA and its variants PiSSA and MiLoRA. The experiments indicate that orthonormal initialization leads to better training outcomes, addressing the issues of instability observed in previous methods.
Implications
The findings suggest that proper initialization strategies are crucial for the effective application of low-rank adaptation in reinforcement learning contexts. This work may influence future research on parameter-efficient fine-tuning methods and their applications in various domains, including natural language processing and decision-making tasks.
Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
Large Language Models
Reinforcement Learning
Theory
- Introduces the first study on evaluator calibration as a method to mitigate preference coupling in LLM feedback loops.
- Demonstrates that confidence-calibrated TTRL reduces coupling coefficients and divergence metrics significantly.
- Confirms that the observed effects are not due to output format changes through a symmetric-LR control.
- Releases a calibrated TTRL protocol as a lightweight solution for LLM deployment pipelines.
Read more
Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
Summary
This paper investigates the phenomenon of evaluator preference coupling in large language model (LLM) agents, where biases from evaluators propagate through feedback loops, affecting the agents' learned strategies. The study introduces a novel approach of applying probability calibration to evaluator feedback to mitigate this issue. The authors conduct a controlled experiment comparing standard binary test-time reinforcement learning (TTRL) with a confidence-calibrated version, using DeepSeek-V4-Pro as the executor and GLM5.2 as the evaluator. The results demonstrate that probability calibration significantly reduces the coupling coefficient (γ) by 20-49% and the Jensen-Shannon divergence (JSD) by 45-67%. This indicates that calibrated feedback can effectively diminish the adverse effects of evaluator biases on agent behavior. The paper also provides a calibrated TTRL protocol that can be easily integrated into existing LLM systems without requiring changes to the executor models, thus offering a practical solution for improving LLM-as-judge deployments.
Methodology
The study employs a within-subjects experimental design, comparing standard binary TTRL with a calibrated version that uses probability estimates from evaluators instead of binary judgments. The updates to the agent's strategy weights are adjusted based on the confidence scores provided by the evaluator, allowing for more nuanced feedback.
Results
The application of probability calibration resulted in a reduction of the coupling coefficient γ by 20-49% and a decrease in Jensen-Shannon divergence by 45-67%. These findings indicate that calibrated feedback effectively reduces the influence of evaluator biases on agent strategies.
Implications
The findings suggest that implementing probability calibration in LLM systems can enhance the reliability of evaluator feedback, potentially leading to more robust agent behavior and reducing the risk of preference collapse in multi-agent systems. This has significant implications for the deployment of LLMs in various applications where evaluative feedback is critical.
Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions
Reinforcement Learning
Optimization
Theory
- Developed a policy optimization algorithm for episodic tabular MDPs with unknown transitions.
- Achieved data-dependent regret bounds, including first-order, second-order, and path-length complexities.
- Introduced a transition-dependent complexity term that captures the cost of estimating the transition kernel.
- Demonstrated gap-dependent polylog(T) regret in stochastic regimes.
Read more
Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions
Summary
This paper addresses the challenge of policy optimization in online episodic tabular Markov decision processes (MDPs) with unknown transition kernels. The authors develop a novel algorithm based on optimistic follow-the-regularized-leader that achieves data-dependent regret bounds, which adapt to the complexity of the loss sequence. Previous works have established data-dependent guarantees only under known transitions, leaving a gap in understanding whether such guarantees could be achieved when transitions are unknown. The proposed algorithm incorporates a new design of optimistic Q-function estimators and a data-dependent transition bonus to manage estimator bias. The analysis reveals a transition-dependent complexity term that reflects the intrinsic cost of estimating the transition kernel. The results demonstrate that the algorithm can achieve first-order, second-order, and path-length bounds while also ensuring gap-dependent polylog(T) regret in stochastic settings. This work is significant as it is the first to provide data-dependent guarantees for policy optimization under unknown transitions, achieving best-of-both-worlds performance.
Methodology
The authors propose an algorithm based on optimistic follow-the-regularized-leader, utilizing optimistic Q-function estimators and a data-dependent transition bonus. The algorithm is designed to adapt to both adversarial and stochastic loss sequences, allowing it to minimize regret effectively while estimating unknown transition kernels.
Results
The algorithm achieves first-order, second-order, and path-length bounds with a transition-dependent complexity term. In the adversarial regime, it adapts to the complexity of the loss sequence, while in the stochastic regime, it achieves gap-dependent polylog(T) regret. This represents a significant advancement in policy optimization under unknown transitions.
Implications
The findings have implications for reinforcement learning applications where transition dynamics are not fully known, such as robotics and game playing. The ability to achieve data-dependent bounds can lead to more efficient learning and better performance in practical scenarios.
Predictable GRPO: A Closed-Form Model of Training Dynamics
Reinforcement Learning
Large Language Models
Theory
- Introduces a closed-form model for GRPO training dynamics, enhancing mechanistic understanding.
- Reinterprets empirical saturation laws through a stochastically-forced damped oscillator framework.
- Provides measurable predictions and diagnostics for training dynamics, distinguishing failure modes.
- Empirical validation shows strong correlation with training reward trajectories across different models.
Read more
Predictable GRPO: A Closed-Form Model of Training Dynamics
Summary
This paper presents a novel closed-form model for the training dynamics of Group Relative Policy Optimization (GRPO), a method widely used to enhance the reasoning capabilities of large language models (LLMs). The authors identify that existing descriptions of GRPO training dynamics are largely empirical, relying on low-parameter functional forms that lack mechanistic insight. To address this, they develop a reduced-order model based on a mean-field assumption that simplifies the GRPO update to a stochastically-forced damped oscillator. This model allows for the derivation of fixed coefficients that are determined by optimizer hyperparameters and a measured curvature scale. The authors demonstrate that their model can explain the empirical single-exponential saturation law as an overdamped limit, providing a mechanistic interpretation of previously fitted constants. Additionally, the model yields predictions that are tied to measurable quantities, such as group-size invariance and stability thresholds, and offers diagnostics to differentiate various failure modes in training dynamics. Empirical validation across multiple models shows a high correlation (R² ≥ 0.91) with training reward trajectories, confirming the model's effectiveness and predictive power.
Methodology
The authors employ a first-principles approach to derive a reduced-order model of GRPO training dynamics. They utilize a mean-field assumption to simplify the GRPO update into a stochastically-forced damped oscillator, allowing for the calculation of coefficients that are fixed in closed form based on hyperparameters and curvature scales. The model is validated through empirical testing across multiple models and group sizes.
Results
The closed-form model fits training reward trajectories with an R² value of at least 0.91 across three models and two group sizes. The predictions regarding group-size invariance and stability thresholds are confirmed through empirical testing. The model also successfully reproduces the overdamped-to-oscillatory transition in a controlled setting, validating its theoretical predictions.
Implications
This work has significant implications for the optimization of training dynamics in reinforcement learning, particularly for large language models. By providing a mechanistic understanding of GRPO, it can lead to more efficient training strategies and better performance in reasoning tasks. The diagnostics developed can help practitioners identify and address specific failure modes in their training processes.
Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models
Theory
Interpretability
- Post-hoc explanation methods do not guarantee insights into the structure of phenomena.
- Reliability and faithfulness are necessary but insufficient for justified claims about the world.
- The paper distinguishes between descriptive and justificatory assessments of models and explanations.
- The authors argue that the composition of reliability and faithfulness does not lead to valid claims about the underlying structure of phenomena.
Read more
Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models
Summary
This paper critiques the reliance on post-hoc explanation methods in scientific machine learning, arguing that while models may be deemed reliable and explanations faithful, this does not guarantee that the model accurately represents the underlying structure of the phenomenon it aims to describe. The authors differentiate between two types of assessments: the descriptive nature of explanations (faithfulness) and the justificatory nature of model predictions (reliability). They contend that combining these two assessments does not yield a justified claim about the phenomenon's structure. The paper emphasizes that the inference from a reliable model and a faithful explanation to a claim about the world is structurally flawed, regardless of how reliable or faithful the components are. This clarification is crucial for the scientific community, which often seeks to derive insights about natural phenomena from machine learning models without adequately addressing the limitations of such inferences.
Methodology
The authors utilize a formalism familiar to the machine learning community to analyze the relationships between the true function of a phenomenon, the predictive model, and the post-hoc explanations generated by various XAI methods. They critically assess the epistemic implications of these relationships.
Results
The paper concludes that even with perfect reliability and faithfulness, the composition of these assessments fails to support claims about the structure of the phenomenon. This structural failure holds true regardless of the quality of the model or explanation.
Implications
The findings suggest that scientists should be cautious when interpreting machine learning models as sources of insight into natural phenomena. The limitations of post-hoc explanations must be acknowledged to avoid overreliance on these methods for scientific understanding.
Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias
Generative Models
Theory
Optimization
- Identifies residual target misspecification as a fundamental cause of under-dispersion in probabilistic downscaling.
- Introduces ReMatch, a method that aligns training and test-time residual distributions using optimal transport.
- Demonstrates that ReMatch outperforms traditional mean-residual models and state-of-the-art super-resolution techniques.
- Provides empirical evidence through controlled synthetic benchmarks and real-world applications.
Read more
Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias
Summary
This paper addresses the challenge of probabilistic downscaling, which involves modeling the conditional distribution of high-resolution fields based on coarse inputs, a critical task in atmospheric science and climate modeling. The authors identify a significant issue with the widely used mean-residual approach, which often leads to biased and under-dispersive ensembles in real-world applications. They argue that this under-dispersion is not merely a result of generic predictive uncertainty but stems from a fundamental problem termed 'residual target misspecification.' This occurs when the distribution of residuals used during training does not match the distribution required during testing due to downscaling bias. To mitigate this issue, the authors propose a novel method called ReMatch (Residual Distribution Matching), which aligns the training residual distribution with the test-time distribution using optimal transport in a low-dimensional PCA space. This approach retains the advantages of the mean-residual framework while addressing the train-test mismatch. The authors validate ReMatch through experiments on both synthetic benchmarks with varying bias levels and a real-world wind field downscaling task, demonstrating significant improvements in calibration and ensemble performance over existing models.
Methodology
The authors propose ReMatch, which utilizes optimal transport to align the training residual distribution with the test-time distribution in a low-dimensional PCA space. This method is integrated into the existing mean-residual framework to enhance the training of the stochastic residual generator.
Results
ReMatch significantly reduces under-dispersion and improves calibration metrics such as SSR (Skill Score Ratio) and CRPS (Continuous Ranked Probability Score) across both synthetic benchmarks and a real-world wind field downscaling task, outperforming strong baselines including standard mean-residual models and advanced super-resolution models.
Implications
The findings suggest that addressing residual target misspecification can lead to more accurate and reliable probabilistic downscaling methods, which are crucial for applications in climate modeling, hazard assessment, and decision-making processes that rely on high-resolution environmental data.