AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
24
Papers today
8h
Update frequency
7
Days of history
Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
Time Series
- Long-context scaling in TSFMs leads to increased forecasting errors due to stochastic noise accumulation.
- RAFT outperforms traditional long-context models by using selective retrieval of relevant historical data.
- The study reveals an inverse scaling law in time series forecasting, contradicting the scaling hypothesis from NLP.
- Dynamic exogenous variables from retrieved segments enhance model performance without the noise penalties of longer contexts.
Read more
Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
Summary
This paper challenges the prevailing assumption in Time Series Foundation Models (TSFMs) that longer historical context improves forecasting accuracy. The authors conduct experiments using the ETTh1 benchmark to demonstrate that increasing context length often leads to worse performance due to the accumulation of stochastic noise. Specifically, they find that a 3,000-step context window results in a 68% drop in accuracy compared to a 720-step window. In contrast, the proposed Retrieval-Augmented Forecasting (RAFT) model, which utilizes a fixed 720-step window and selectively retrieves relevant historical segments, achieves a mean squared error (MSE) of 0.379, outperforming both long-context models and zero-shot foundation models. The findings suggest that selective retrieval mechanisms provide a more effective means of incorporating historical context, as they mitigate the noise associated with irrelevant data. The authors advocate for a shift in architectural design towards retrieval-based approaches in future time series forecasting models.
Methodology
The authors conducted experiments on the ETTh1 benchmark, comparing three model types: a vanilla Transformer, PatchTST (the state-of-the-art long-context model), and their proposed RAFT model. They varied the context window sizes (720, 1440, 3000) and measured performance using mean squared error (MSE) and mean absolute error (MAE) metrics. RAFT employed cosine similarity for selective retrieval of relevant historical segments.
Results
The experiments revealed that as the context window increased from 720 to 3,000 steps, the forecasting accuracy of PatchTST dropped by over 68%. In contrast, RAFT achieved the best MSE of 0.379 with a fixed 720-step window, demonstrating superior performance compared to both long-context configurations and zero-shot models.
Implications
The results suggest that future time series forecasting models should prioritize selective retrieval mechanisms over traditional long-context approaches. This could lead to more efficient and accurate forecasting methods, particularly in stochastic environments where irrelevant historical data can degrade performance.
On Uniform Error Bounds for Kernel Regression under Non-Gaussian Noise
Theory
Robotics
Reinforcement Learning
- Introduction of non-asymptotic probabilistic uniform error bounds for kernel regression.
- Extension of error bounds to a wide range of non-Gaussian noise distributions.
- Separation of uncertainty into exploration and noise components for improved accuracy.
- Demonstration of the tightness of proposed bounds through numerical examples.
Read more
On Uniform Error Bounds for Kernel Regression under Non-Gaussian Noise
Summary
This paper addresses the challenge of providing non-conservative uncertainty quantification for function estimates derived from noisy observations, particularly in safety-critical applications. The authors propose novel non-asymptotic probabilistic uniform error bounds for kernel-based regression that extend beyond the limitations of existing bounds, which are typically restricted to sub-Gaussian noise. The proposed bounds accommodate a broader class of non-Gaussian distributions, including sub-Gaussian, bounded, sub-exponential, and variance/moment-bounded noise, and are applicable to both correlated and uncorrelated noise. By separating the uncertainty into two components—one due to exploration of the function space and the other due to noise corruption—the authors derive tailored error bounds that are shown to be tighter than existing results. Numerical examples demonstrate the effectiveness of the proposed bounds in improving safety certificates for machine learning applications in fields such as safe Bayesian optimization and reinforcement learning.
Methodology
The authors develop probabilistic uniform error bounds by analyzing the statistical properties of various noise distributions and separating the uncertainty into two distinct components. They utilize kernel-based regression methods and derive bounds that are valid for finite data samples, ensuring applicability in practical scenarios.
Results
The proposed error bounds are shown to significantly outperform existing bounds in terms of induced uncertainty regions. Numerical evaluations confirm the tightness of the bounds, indicating that they can provide more reliable safety certificates in machine learning applications.
Implications
The findings have significant implications for the deployment of machine learning models in safety-critical applications, where reliable uncertainty quantification is essential. The ability to handle a broader class of noise distributions enhances the robustness of kernel regression methods in real-world scenarios.
Higher-Order Equilibrium Tracking for EM-Compressible Online Estimation
Theory
Optimization
Efficient ML
- Introduces an empirical-equilibrium formulation for online estimation, separating statistical fluctuation from tracking lag.
- Develops higher-order equilibrium-jet tracking methods that improve convergence rates.
- Defines EM-compressibility and EM-jet-compressibility as conditions for effective online estimation.
- Establishes a batch-to-online transfer theorem linking online performance to batch properties.
Read more
Higher-Order Equilibrium Tracking for EM-Compressible Online Estimation
Summary
This paper addresses the challenge of online estimation in latent-variable models by reformulating the problem as tracking a moving empirical equilibrium. Traditional online EM and stochastic approximation methods focus on convergence to population parameters without adequately addressing the empirical batch optimum and the associated online tracking error. The authors propose a framework that separates the online estimate into a frozen batch equilibrium based on the current running statistic and a tracking lag that quantifies the algorithm's delay. They establish a batch-to-online transfer theorem, demonstrating that if the tracking lag diminishes appropriately, the online estimator can inherit properties from the batch central limit theorem. The paper introduces higher-order equilibrium-jet predictors and frozen correctors to enhance tracking rates, achieving localized rates of O(T^(-ν(m+1))). The theory is exemplified in latent linear Gaussian covariance estimation, where the first-order scheme operates on a compressed statistic with defined finite-sample risk envelopes and a certified restart rule.
Methodology
The authors utilize a theoretical framework that decomposes online estimates into a frozen batch equilibrium and a tracking lag. They employ Taylor expansions to create higher-order predictors and correctors, allowing for improved tracking rates. The framework is validated through a batch-to-online transfer theorem and applied to latent linear Gaussian models.
Results
The paper proves that under certain conditions, the online estimator can inherit the batch central limit theorem and sharp first-order risk constants. The proposed higher-order tracking methods yield localized rates of O(T^(-ν(m+1))), with the first-order scheme achieving O(T^(-2)).
Implications
The findings suggest that online estimation can achieve efficiency comparable to batch methods while processing data streams, which is crucial for applications in real-time data analysis and machine learning systems that require bounded memory and computational resources.
Task-Aware Calibration: Provably Optimal Decoding in LLMs
NLP
Large Language Models
Theory
- Introduces task calibration to improve LLM output quality by aligning predictive distributions with task-specific latent structures.
- Demonstrates that MBR decoding on task-calibrated distributions is the optimal strategy for minimizing expected loss.
- Presents Task Calibration Error (TCE) as a new metric for assessing miscalibration in LLMs, showing it is a strong predictor of performance improvements.
- Empirical evaluations confirm that task calibration consistently enhances generation quality across diverse tasks.
Read more
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Summary
This paper addresses the issue of miscalibration in the predictive distributions of large language models (LLMs) during decoding, which can lead to suboptimal decision-making. The authors propose a novel approach called task calibration, which abstracts the calibration process from the combinatorial complexity of free-form language outputs to a more manageable latent space that captures semantically meaningful outcomes. By leveraging decision-theoretic principles, they demonstrate that Minimum Bayes Risk (MBR) decoding applied to this task-calibrated latent distribution yields optimal decisions. The paper introduces a new metric, Task Calibration Error (TCE), to quantify the excess loss incurred due to miscalibration. Empirical results show that task calibration significantly improves generation quality across various tasks and that TCE effectively predicts the benefits of calibration, outperforming traditional metrics like Expected Calibration Error (ECE). Overall, the proposed methodology enhances the reliability of LLM outputs in practical applications.
Methodology
The authors develop a framework for task calibration that maps LLM outputs into a latent space representing semantically meaningful outcomes. They apply decision-theoretic principles to establish that MBR decoding on this calibrated distribution minimizes expected loss. The methodology includes the introduction of TCE as a metric for quantifying miscalibration and its empirical validation across various tasks.
Results
The empirical results indicate that MBR decoding on task-calibrated distributions leads to higher-quality outputs compared to uncalibrated distributions. The TCE metric effectively predicts the performance gains from calibration, demonstrating that LLMs are often miscalibrated across tasks. The findings suggest that task calibration can significantly enhance decision-making in LLM applications.
Implications
The proposed task-aware calibration approach has potential applications in improving the reliability and quality of LLM outputs in various domains, such as automated content generation, sentiment analysis, and decision support systems. By addressing miscalibration, this work paves the way for more effective use of LLMs in real-world scenarios.
RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction
Time Series
- RareCP enhances conformal prediction efficiency by addressing temporal drift and error regime structures.
- The method utilizes a mixture of cosine-attention experts to capture distinct error regimes.
- RareCP retrieves relevant calibration examples to form adaptive prediction intervals.
- It achieves competitive performance against state-of-the-art quantile forecasters while being backbone-agnostic.
Read more
RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction
Summary
The paper introduces RareCP, a novel regime-aware retrieval method designed to enhance the efficiency of conformal prediction in time series forecasting. Traditional conformal prediction methods struggle with temporal dependence, drift, and heterogeneous error behavior, leading to inefficiencies. RareCP addresses these issues by learning local calibration representations through a mixture of cosine-attention experts that capture distinct error regimes. It employs a compact hypernetwork to adapt kernel parameters for tracking temporal drift. For each new forecasting context, RareCP retrieves the top-k relevant calibration examples, assigns similarity weights, and forms a weighted conformal quantile over their signed residuals, resulting in asymmetric prediction intervals. The method is trained using a smooth interval score objective and incorporates a parameter-space anchor to a lightweight teacher kernel to maintain stable local representations. Experimental results on the GIFT-Eval benchmark demonstrate that RareCP significantly improves interval efficiency compared to recent conformal baselines and foundation model uncertainty estimates while ensuring empirical coverage. Ablation studies confirm the contributions of regime-specific experts, drift-adaptive kernels, sparse retrieval, and teacher anchoring to the overall performance.
Methodology
RareCP employs a regime-aware retrieval mechanism that learns local calibration representations through cosine-attention experts. It retrieves the most relevant past signed residuals based on a learned similarity metric and forms weighted conformal quantiles to produce asymmetric prediction intervals. The adaptive kernel is trained using a smooth interval score objective, anchored to a lightweight teacher kernel for stability.
Results
On the GIFT-Eval benchmark, RareCP outperforms existing conformal prediction methods and foundation model uncertainty estimates in terms of interval efficiency while maintaining empirical coverage. The ablation studies confirm that the various components of RareCP, including regime-specific experts and drift-adaptive kernels, contribute significantly to its performance.
Implications
RareCP has potential applications in various domains requiring reliable uncertainty quantification in time series forecasting, such as finance, healthcare, and industrial operations. Its ability to adapt to changing error regimes and temporal drift makes it a valuable tool for risk-aware decision-making.
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies
Theory
Optimization
- Introduces a dynamic model for MF-MAB where low-fidelity sources improve with use.
- Develops the TACC algorithm that balances low-fidelity sampling and high-fidelity escalation.
- Establishes an instance-dependent regret bound that highlights the benefits of adaptive continuation.
- Demonstrates the algorithm's effectiveness through synthetic simulations and LLM-based evaluations.
Read more
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies
Summary
This paper explores the adaptive multi-fidelity multi-armed bandit (MF-MAB) problem, where the low-fidelity feedback sources can improve over time with repeated use. Traditional MF-MAB models assume a static bias between low and high fidelity sources, but the authors argue that modern proxy sources, such as learning-based simulators and large language models (LLMs), can become more accurate with additional calibration. The authors introduce a selected-average mismatch bound that captures the evolving discrepancy between low and high fidelity sources, allowing for improvement-aware confidence bounds. They propose the Threshold-Based Adaptive Continuation Companion (TACC) algorithm, which optimally decides when to continue using low-fidelity sources versus escalating to high-fidelity evaluations. The paper includes a theoretical analysis that provides an instance-dependent regret bound, demonstrating that for certain arms, the cost of confirming their quality can be reduced by leveraging low-fidelity information. Empirical experiments validate the effectiveness of the TACC algorithm in reducing cost-weighted regret in both synthetic bandit scenarios and a policy-evaluation task using an LLM.
Methodology
The authors formulate a two-fidelity bandit model where the low-fidelity source's accuracy improves with the number of queries. They introduce a selected-average mismatch bound to derive improvement-aware confidence bounds for high-fidelity targets. The TACC algorithm is designed to optimize the decision-making process regarding when to continue using low-fidelity sources versus switching to high-fidelity evaluations. Theoretical analysis is complemented by empirical validation through simulations and real-world tasks.
Results
The TACC algorithm shows a significant reduction in cost-weighted regret compared to traditional methods. The theoretical analysis confirms that for certain arms, the need for high-fidelity confirmation can be replaced with a bounded low-fidelity continuation, leading to more efficient decision-making.
Implications
This work has potential applications in fields where decision-making involves balancing cost and accuracy, such as online advertising, clinical trials, and resource allocation in machine learning. The findings suggest that adaptive strategies can enhance the efficiency of multi-fidelity evaluations.
Machine Learning-Based Graph Simplification for Symbolic Accelerators
Optimization
Graph Learning
Efficient ML
- AutoSlim effectively reduces the size and complexity of automata graphs while preserving their semantics.
- The framework utilizes a Random Forest classifier to identify redundant nodes and edges based on extracted features.
- Implementation of AutoSlim on NAPOLY+ resulted in up to 40% reduction in FPGA resource usage.
- The approach includes a verification step to ensure functional equivalence after graph pruning.
Read more
Machine Learning-Based Graph Simplification for Symbolic Accelerators
Summary
The paper introduces AutoSlim, a machine learning-based framework designed to optimize automata graphs for symbolic data processing applications, particularly in hardware accelerators like FPGAs. Symbolic data, prevalent in fields such as genomics and cybersecurity, is often modeled using finite automata, which can lead to inefficiencies due to excessive memory usage and redundant structures. AutoSlim addresses these issues by employing a Random Forest classifier to identify and prune low-impact nodes and edges from automata graphs. The framework operates during the preprocessing stage, ensuring that the functional properties of the original automaton are preserved while reducing its size and complexity. When applied to the NAPOLY+ architecture, AutoSlim demonstrated a reduction in FPGA resource usage by up to 40%, alongside improvements in throughput and power efficiency. The methodology includes a verification step to maintain functional equivalence post-pruning, and it opens avenues for future research in hardware optimization and security analysis.
Methodology
AutoSlim employs a machine learning-based pruning pipeline that extracts structural and behavioral features from automata graphs. It uses a Random Forest classifier to predict the utility of nodes and edges, removing those deemed redundant while ensuring that critical paths remain intact.
Results
The application of AutoSlim to the NAPOLY+ architecture resulted in a significant reduction of FPGA resource usage by up to 40%, along with enhancements in throughput and power efficiency. The framework successfully maintained functional equivalence post-pruning.
Implications
The findings suggest that AutoSlim can be instrumental in optimizing hardware accelerators for symbolic processing, potentially leading to more efficient implementations in various applications such as bioinformatics and cybersecurity. Additionally, the framework opens up possibilities for integrating security analysis into automata processing systems.
Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective
Theory
Optimization
- Existing diagnostics for plasticity in neural networks can be misleading and fail to predict trainability.
- The authors propose a new metric, Optimization Readiness (OR), which effectively predicts a model's ability to adapt to new tasks.
- Theoretical guarantees are provided for OR, establishing its predictive power in optimization contexts.
- Empirical results demonstrate OR's superiority over traditional metrics in ranking model checkpoints by trainability.
Read more
Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective
Summary
This paper addresses the challenge of predicting plasticity in deep continual learning, where neural networks must adapt to new tasks without retraining from scratch. The authors highlight the phenomenon of plasticity loss, where previously trained models struggle to adapt to new tasks, often performing worse than randomly initialized networks. They critique existing diagnostics for plasticity, such as representation rank and neural tangent kernel rank, demonstrating through counterexamples that these metrics can fail to predict trainability in both regression and classification tasks. To overcome these limitations, the authors propose a novel metric called Optimization Readiness (OR), which combines gradient strength and reliability. They provide theoretical guarantees that OR can lower-bound one-step optimization gain under standard smoothness assumptions. Empirical evaluations on standard benchmarks, including Slowly-Changing Regression and Permuted MNIST, show that OR reliably ranks checkpoints by trainability, outperforming existing diagnostics even with significantly fewer samples.
Methodology
The authors construct theoretical counterexamples to demonstrate the shortcomings of existing plasticity diagnostics. They then propose the Optimization Readiness (OR) metric, which is derived from the dynamics of stochastic gradient descent. The paper includes both theoretical proofs and empirical evaluations across standard continual learning benchmarks to validate OR's effectiveness.
Results
The study shows that traditional metrics like representation rank and eNTK rank do not reliably predict trainability. In contrast, the proposed OR metric consistently ranks model checkpoints more accurately in terms of their trainability across various tasks, even with limited data samples.
Implications
The findings suggest that relying on traditional diagnostics for assessing neural network plasticity may lead to incorrect conclusions about a model's adaptability. The introduction of OR could enhance the development of more robust continual learning systems, improving their performance in real-world applications where adaptability is crucial.
When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning
Multimodal
Efficient ML
Optimization
- Adaptive prompting mechanisms in VLMs often collapse, leading to ineffective adaptation.
- Two main failure modes identified: gradient magnitude imbalance and gate degradation.
- The study uses AdaptiveBiMaPLe as a controlled framework to analyze optimization dynamics.
- Findings suggest that additional architectural complexity may not yield meaningful benefits.
Read more
When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning
Summary
This paper investigates the failures of adaptive prompting mechanisms in vision-language models (VLMs), particularly in the context of frozen few-shot prompt learning with CLIP-style backbones. The authors observe that adaptive gates and prompt-selection modules often collapse, producing nearly constant outputs and contributing negligible gradient signals, which leads to performance that does not surpass fixed prompts. Through a systematic diagnostic study, the authors identify two primary failure modes: gradient magnitude imbalance, where gate parameters receive significantly smaller gradients than prompt parameters, and gate degradation, where gates converge to stable values without input-dependent variation. These findings challenge the assumption that adding architectural complexity will enhance performance in parameter-efficient learning and clarify the conditions under which adaptive gating is effective. The study employs a controlled testbed, AdaptiveBiMaPLe, to trace the optimization dynamics leading to these failures across multiple datasets and prompt learning architectures, ultimately providing diagnostic criteria and design implications for evaluating adaptive prompting.
Methodology
The authors conducted a systematic diagnostic study using a controlled testbed called AdaptiveBiMaPLe, which extends the MaPLe framework. They performed experiments across various datasets (ImageNet, Caltech101, EuroSAT) and multiple prompt learning architectures to analyze the optimization dynamics and identify failure modes in adaptive prompting mechanisms.
Results
The study revealed that adaptive prompting mechanisms frequently collapse into static configurations, with gate parameters receiving gradients 2-3 orders of magnitude smaller than prompt parameters. This imbalance creates barriers to meaningful adaptation. The occasional performance gains observed in small datasets were attributed to a parameter-count buffering effect rather than genuine adaptive behavior.
Implications
The findings suggest a need for re-evaluating the design of adaptive prompting mechanisms in VLMs, particularly in frozen prompt-only tuning scenarios. The insights could inform future research on optimizing prompt learning strategies and understanding the limitations of adaptive gating in machine learning.
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
Large Language Models
Optimization
Efficient ML
- PRISM co-designs scheduling and KV-cache management to optimize online LLM serving.
- The architecture leverages reusable segments in prompts to enhance cache hit rates.
- Experimental results indicate significant reductions in TTFT compared to baseline methods.
- The approach addresses the inefficiencies of independent scheduling and KV-cache management.
Read more
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
Summary
The paper presents PRISM, a novel architecture designed to optimize the serving of online large language models (LLMs) by addressing the challenges posed by prompt segmentation and hotspot skew in user requests. Traditional methods have treated KV-cache management and request scheduling independently, leading to inefficiencies in handling frequently recurring segments of prompts. PRISM integrates a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with KV-cache retention, thereby improving throughput and reducing time-to-first-token (TTFT). The authors conduct a bottleneck analysis to demonstrate how these two components interact and affect performance. The experimental results show that PRISM significantly outperforms existing methods, achieving higher KV-cache hit rates and reducing average per-QPS P99 TTFT by 23.3% on a 4B model and 37.1% on a 13B model.
Methodology
The authors developed PRISM by first analyzing the interaction between request scheduling and KV-cache management. They introduced a query-aware scheduler that organizes requests based on reusable segments and a demand-aware radix tree that retains high-value prefixes in the KV-cache. This co-design allows for better alignment of request admission with cache retention strategies, ultimately improving performance metrics.
Results
PRISM achieved a 23.3% reduction in average per-QPS P99 TTFT on the 4B model and a 37.1% reduction on the 13B model, while also increasing the exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points, respectively, compared to the strongest baseline.
Implications
The findings suggest that integrating scheduling and memory management can significantly enhance the efficiency of online LLM services. This approach could be applied to various retrieval-augmented generation systems and agent-based applications, improving user experience through faster response times.
Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
Theory
Optimization
Efficient ML
- Introduction of the AB-SID-iVAR acquisition function for active learning under self-induced distributions.
- Theoretical guarantees for convergence of prediction error in high-probability and average cases.
- Demonstrated superior performance on synthetic and real-world datasets compared to existing methods.
- Applicability to both discrete and continuous input domains without requiring partition function estimation.
Read more
Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
Summary
This paper addresses the challenge of active learning in scenarios where the target distribution is self-induced by the function being learned, particularly in applications like potential energy surface modeling in computational chemistry. The authors introduce a novel acquisition function, AB-SID-iVAR, which approximates the intractable Bayesian target distribution without requiring partition function estimation. This method is applicable to both discrete and continuous input domains. Additionally, a Thompson sampling variant, TS-SID-iVAR, is proposed as a higher variance alternative. The authors establish theoretical guarantees for the convergence of prediction error under mild conditions, demonstrating that the terminal prediction error can vanish with high probability. Empirical evaluations on synthetic benchmarks and real-world tasks, such as drug discovery, show that the proposed methods consistently outperform existing approaches, highlighting their effectiveness in reducing weighted mean-squared error (MSE).
Methodology
The authors propose the AB-SID-iVAR acquisition function, which approximates the Bayesian target distribution in closed form. They also analyze a Thompson sampling variant, TS-SID-iVAR, to provide a higher variance Monte Carlo approach. The methods are grounded in Gaussian Process regression, allowing for effective modeling of the unknown function and its uncertainty. Theoretical analysis is provided to establish convergence guarantees for the proposed methods.
Results
The proposed methods, AB-SID-iVAR and TS-SID-iVAR, show significant improvements in reducing weighted mean-squared error (wmse) compared to traditional methods like random sampling and uncertainty sampling. Empirical results indicate that the methods achieve lower wmse in both synthetic benchmarks and real-world applications, such as potential energy surface modeling and drug discovery.
Implications
The findings suggest that the proposed active learning framework can be effectively utilized in fields where function evaluations are costly and the target distribution is influenced by the function itself. This has potential applications in computational chemistry, drug discovery, and other scientific domains where modeling complex functions is critical.
Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
Theory
Optimization
- Characterizes the Pareto frontier of binary prediction-based decision systems, highlighting the trade-off between fairness and performance.
- Introduces a multi-objective optimization framework that incorporates various fairness metrics and justice-theoretic principles.
- Demonstrates that the Pareto frontier includes both lower-bound and upper-bound threshold rules, depending on the fairness metric used.
- Shows that the Pareto frontier's location is determined by population characteristics and utility functions, not by the algorithm's design.
Read more
Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
Summary
This paper explores the trade-off between fairness and performance in algorithmic decision systems, conceptualizing the decision-making process as a multi-objective optimization problem. The authors investigate the Pareto frontier of binary prediction-based decision systems, focusing on the balance between decision-maker utility and group fairness. They analyze various fairness metrics derived from justice-theoretic principles, including egalitarianism and prioritarianism, and demonstrate that the Pareto frontier consists of deterministic, group-specific threshold rules based on individuals' success probabilities. The findings reveal that the location of the Pareto frontier is influenced by population characteristics, utility functions, and fairness scores, rather than the algorithm's technical design. This research extends existing optimality theorems in fairness-constrained classification and provides a principled foundation for evaluating algorithmic decision systems in light of legal and ethical standards.
Methodology
The authors conceptualize decision-making as a multi-objective optimization problem, analyzing the Pareto frontier of binary decision systems. They optimize decision rules based on feature vectors and fairness metrics, exploring the implications of various justice-theoretic principles. The study includes deterministic, group-specific threshold rules and evaluates the effects of different fairness metrics on the Pareto frontier.
Results
The study finds that the Pareto frontier consists of deterministic, group-specific threshold rules and can include upper-bound threshold rules favoring individuals with lower success probabilities. The location of the Pareto frontier is shown to depend on population characteristics, utility functions, and fairness scores, rather than the algorithm's technical design.
Implications
The findings provide a framework for evaluating algorithmic decision systems in terms of fairness and performance, aiding developers in creating less discriminatory alternatives. This research has potential applications in various sectors, including finance, healthcare, and governance, where algorithmic decision-making impacts individuals' lives.
AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
Large Language Models
Efficient ML
Optimization
- AAAC introduces learned scalar codebooks for improved weight quantization accuracy.
- The method operates with negligible overhead, adding only 64 bytes per layer.
- AAAC is a gradient-free approach, eliminating the need for backpropagation.
- It significantly reduces quantization time compared to existing methods.
Read more
AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
Summary
The paper introduces AAAC (Activation-Aware Adaptive Codebooks), a novel method for 4-bit weight quantization of large language models (LLMs) aimed at reducing memory and compute costs during inference. Traditional post-training quantization (PTQ) methods, such as AWQ and GPTQ, utilize fixed scalar codebooks and improve quantization through scaling and error compensation, often requiring significant time and resources. In contrast, AAAC replaces the fixed codebook with two small learned scalar codebooks per layer, allowing for a more adaptive approach that minimizes activation-weighted reconstruction error. This method is lightweight, requiring only 64 bytes of additional storage per layer, and operates without the need for backpropagation, using activation-weighted k-means for codebook learning. The evaluation of AAAC against various baseline methods demonstrates its superior performance in terms of accuracy and quantization time, completing the process in just 3 to 30 minutes on a single GPU while maintaining peak memory usage close to the original model size.
Methodology
AAAC employs a lightweight approach to 4-bit weight quantization by utilizing two learned scalar codebooks per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, with the selection encoded in the unused sign bit of the scale. The codebooks are learned using activation-weighted k-means during the forward pass, avoiding the need for backpropagation or additional memory overhead.
Results
The experimental results indicate that AAAC outperforms existing quantization methods, including AWQ, GPTQ, and others, achieving better accuracy with significantly reduced quantization times ranging from 3 to 30 minutes. The method maintains peak memory usage comparable to the original model size, demonstrating its efficiency.
Implications
The findings suggest that AAAC can be effectively applied in scenarios where rapid and efficient weight quantization is necessary, particularly for deploying large language models in resource-constrained environments. This method could facilitate broader adoption of LLMs in real-time applications by reducing the computational burden.
Selection of the Best Policy under Fairness Constraints for Subpopulations
Theory
Optimization
Efficient ML
- Introduction of the SBFC problem to address fairness in policy selection across subpopulations.
- Development of the T-a-S-CS algorithm that achieves instance-specific lower bounds on sample complexity.
- Extension of the framework to include general fairness specifications with matching guarantees.
- Demonstration of substantial efficiency improvements over existing policy allocation baselines through numerical experiments.
Read more
Selection of the Best Policy under Fairness Constraints for Subpopulations
Summary
This paper addresses the challenge of selecting a single policy that performs adequately across heterogeneous subpopulations, particularly in high-stakes domains like healthcare and public policy. The authors formalize this challenge as the Selection of the Best with Fairness Constraints (SBFC) problem, which aims to identify the policy with the highest average performance while ensuring that each subpopulation meets a minimum performance threshold. They establish a lower bound on the sample complexity for the SBFC problem and introduce the Track-and-Stop with Constraints on Subpopulation (T-a-S-CS) algorithm, which asymptotically achieves this lower bound. The framework is extended to accommodate various fairness specifications and guarantees matching. Through numerical experiments and a case study on the International Stroke Trial, the authors demonstrate that their approach yields significant efficiency gains compared to traditional policy-level allocation methods.
Methodology
The authors formulate the SBFC problem using a fixed-confidence best-arm identification framework, allowing for a globally coupled optimization across a two-dimensional policy-by-subpopulation space. The T-a-S-CS algorithm is designed to learn the performance of candidate policies sequentially while ensuring that fairness constraints are met.
Results
The proposed T-a-S-CS algorithm achieves the established lower bound on sample complexity and demonstrates significant efficiency gains in policy selection, as evidenced by numerical experiments and a case study involving the International Stroke Trial.
Implications
The findings suggest that the SBFC framework can be effectively applied in various high-stakes decision-making contexts where fairness across subpopulations is critical, potentially influencing policy formulation in healthcare, public policy, and clinical development.
Generalized Wasserstein Flow Matching: Transport Plans, Everywhere, All at Once
Generative Models
Theory
Optimization
- Introduction of Wasserstein-on-Wasserstein (WoW) formulation for flow matching.
- Derivation of non-local velocity fields as minimizers of a specific loss function.
- Development of efficient transport couplings using sliced and linear Wasserstein approximations.
- Unification and extension of existing generative modeling methods for point clouds and sets.
Read more
Generalized Wasserstein Flow Matching: Transport Plans, Everywhere, All at Once
Summary
This paper presents a novel extension of flow matching, a framework for generative modeling, to the space of probability measures over probability measures, termed Wasserstein-on-Wasserstein (WoW). The authors leverage the nested Wasserstein geometry to derive velocity fields that facilitate metameasure flows, thus generalizing Wasserstein flow matching through coupled outer and inner transport plans. To mitigate the computational challenges associated with WoW transport, the authors propose scalable approximations using sliced and linear Wasserstein distances, which enhance training efficiency while ensuring numerical stability. This framework not only unifies existing generative modeling approaches for point clouds and sets but also provides a theoretically sound method for generative modeling in WoW spaces, showcasing its versatility and practical applicability.
Methodology
The authors extend the flow matching framework to metameasures by employing a second-order Wasserstein geometry. They derive a generalized Wasserstein flow matching model that couples outer transport plans from the WoW space with inner plans from the Euclidean Wasserstein space. The methodology includes the use of sliced and linear Wasserstein distances to create scalable approximations that balance computational efficiency and numerical stability.
Results
The proposed framework successfully generalizes Wasserstein flow matching, demonstrating effective training and stable velocity fields. The experiments conducted validate the advantages of the new transport couplings, showing improved performance in generative modeling tasks involving point clouds and sets.
Implications
This work has significant implications for generative modeling, particularly in applications involving complex data structures such as point clouds and sets. The theoretical advancements and computational techniques introduced could enhance the efficiency and effectiveness of generative models in various domains.
AdamFLIP: Adaptive Momentum Feedback Linearization Optimization for Hard Constrained PINN Training
Optimization
Theory
- AdamFLIP reformulates PINN training as an equality-constrained optimization problem.
- The method utilizes feedback linearization to stabilize constraint violations.
- AdamFLIP integrates adaptive moment estimation, enhancing convergence and robustness.
- Empirical results show significant improvements in constraint satisfaction and solution accuracy.
Read more
AdamFLIP: Adaptive Momentum Feedback Linearization Optimization for Hard Constrained PINN Training
Summary
The paper introduces AdamFLIP, a novel optimization method for training Physics-Informed Neural Networks (PINNs) under hard constraints. Traditional PINN training relies on soft penalty formulations, which can lead to issues such as ill-conditioning and sensitivity to loss weights. AdamFLIP reformulates the training process as an equality-constrained optimization problem, treating constraint residuals as outputs of a controlled dynamical system. The method computes Lagrange multipliers as feedback inputs to stabilize constraint violations, integrating adaptive moment estimation akin to the Adam optimizer. The authors demonstrate that AdamFLIP significantly outperforms standard soft-constrained PINN approaches and other state-of-the-art constrained optimizers across various benchmark PDE problems, particularly achieving a two-thirds reduction in relative L2 error for the Navier–Stokes equations. This framework offers a scalable and effective solution for hard constraint optimization in PINN training.
Methodology
The authors reformulate the PINN training process as an equality-constrained optimization problem, using feedback linearization to manage constraint violations. AdamFLIP combines this approach with adaptive moment estimation techniques from the Adam optimizer, allowing for effective handling of constraints while maintaining the benefits of adaptive optimization.
Results
AdamFLIP consistently outperformed both standard soft-constrained PINN training and other constrained optimizers across multiple benchmark problems. Notably, for the Navier–Stokes equations, it achieved a reduction in relative L2 error by more than two-thirds compared to the next best method. The finite-time convergence analysis indicates that both stationarity residual and constraint violation converge at a rate of O(log T/√T).
Implications
The AdamFLIP framework provides a robust and scalable method for training PINNs with hard constraints, which can be applied in various scientific and engineering fields where accurate solutions to PDEs are critical. This approach may also influence future research in constrained optimization within machine learning.
Additive Atomic Forests for Symbolic Function and Antiderivative Discovery
Theory
Interpretability
Optimization
- Introduces a self-expanding library of function-derivative pairs for symbolic regression.
- Employs two new primitives, EML and SOL, to efficiently generate elementary function atoms.
- Demonstrates empirical success in classification and symbolic regression tasks, outperforming traditional methods like XGBoost.
- Provides a framework that allows for the simultaneous recovery of functions and their antiderivatives without traditional integration methods.
Read more
Additive Atomic Forests for Symbolic Function and Antiderivative Discovery
Summary
This paper introduces a novel framework for the simultaneous symbolic recovery of a function and its antiderivative from data, termed Additive Atomic Forests. The framework is built on three foundational concepts: a derivative algebra that utilizes the product and chain rules to create a self-expanding library of function-derivative pairs; two complementary primitives, EML (eu−ln v) and SOL (sin u−cos v), which efficiently generate core atoms for the library; and the concept of additive atomic forests, which are finite sums of primitive trees that can be optimized to fit data. The library is dynamic, growing as new functions are discovered, thereby enhancing the expressible class of candidate functions. The framework achieves conditional completeness and offers empirical results demonstrating its effectiveness. In tests across 17 classification benchmarks, sparse combinations of atoms matched or surpassed XGBoost on 13 datasets while yielding interpretable formulas. On the Feynman symbolic regression benchmark, a library constructed to depth 3 achieved exact recovery on 31% of equations and maintained a relative mean squared error below 0.01 on an additional 40%. The method was also applied to real scientific data, proposing candidate radial-acceleration relations from the SPARC galaxy database.
Methodology
The methodology involves constructing a dynamic library of function-derivative pairs using derivative algebra, specifically leveraging the product and chain rules. Two primitives, EML and SOL, are introduced to seed the library with essential elementary functions. The framework utilizes additive atomic forests, which are optimized to fit data through continuous optimization or exhaustive search, allowing for the simultaneous encoding of a function and its derivative.
Results
The framework was tested on 17 classification benchmarks, where it matched or exceeded the performance of XGBoost on 13 datasets. On the Feynman symbolic regression benchmark, it achieved exact recovery on 31% of equations and maintained a relative mean squared error below 0.01 on an additional 40%. The method also successfully proposed candidate radial-acceleration relations from the SPARC galaxy database.
Implications
The proposed framework has significant implications for symbolic regression and function discovery in various scientific fields. Its ability to generate interpretable formulas and recover antiderivatives without traditional integration methods could enhance data analysis and modeling in physics, engineering, and other domains where mathematical relationships are crucial.
Complex-Valued Phase-Coherent Transformer
Theory
Computer Vision
NLP
- PCT introduces token-non-competing attention to preserve phase information in complex-valued computations.
- The architecture consistently outperforms both real-valued and complex-valued Transformers under parameter-fair conditions.
- PCT shows strong generalization across diverse tasks, including those traditionally difficult for complex-valued networks.
- The design principles of PCT address long-standing concerns about depth scalability in complex-valued neural networks.
Read more
Complex-Valued Phase-Coherent Transformer
Summary
The paper introduces the Phase-Coherent Transformer (PCT), a novel architecture for complex-valued Transformers that addresses the limitations of traditional softmax attention mechanisms. PCT employs a real-valued, element-independent smooth gate applied to L2-normalized complex query-key similarities, effectively replacing token competition with token-non-competing attention. This design preserves phase information across layers, which is crucial for complex-valued computations. The authors conduct extensive experiments across various mid-scale benchmarks, demonstrating that PCT outperforms both standard softmax Transformers and existing complex-valued architectures. Notably, PCT excels in tasks traditionally challenging for complex-valued neural networks, such as long-range memory and hierarchical reasoning, while maintaining robustness against depth-related accuracy collapse. The findings suggest that integrating multi-layer phase-coherent structures into attention mechanisms can enhance generalization in complex-valued Transformers.
Methodology
The methodology involves the development of the Phase-Coherent Transformer (PCT) architecture, which utilizes a sigmoid gate on the cosine score of L2-normalized complex query-key pairs without row normalization. The authors conduct systematic comparisons across nine mid-scale benchmarks, evaluating performance against both standard softmax Transformers and complex-valued counterparts. They analyze the effects of different gating mechanisms on performance and explore the implications of token non-competition and multi-layer phase preservation.
Results
PCT demonstrates superior performance across multiple tasks, achieving a perfect score on the long-range NIAH task and exhibiting no depth-related accuracy collapse over a tested range of depths. It consistently outperforms both vanilla complex and real Transformers, as well as the strongest non-softmax baseline, under parameter-fair conditions. The results highlight the effectiveness of the proposed architecture in enhancing generalization and robustness in complex-valued neural networks.
Implications
The findings suggest that rethinking attention mechanisms for complex-valued neural networks can significantly improve their applicability in general-purpose domains such as text and vision. The introduction of phase-coherent structures may pave the way for more effective complex-valued architectures in various applications, including signal processing, image classification, and reasoning tasks.
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
Optimization
Theory
- Identifies a mechanism-level objective mismatch in SAM, where first-order gradient signals dominate instead of second-order curvature.
- Proposes LE-SAM, which fixes the loss budget and adapts the perturbation radius dynamically to focus on curvature.
- Demonstrates that LE-SAM consistently outperforms SAM and its variants across diverse benchmarks.
- Highlights the significance of curvature in achieving better generalization in deep neural networks.
Read more
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
Summary
This paper addresses a fundamental mismatch in the Sharpness-Aware Minimization (SAM) framework, which aims to improve generalization in deep neural networks by minimizing the worst-case loss within a fixed parameter-space radius. The authors argue that SAM's reliance on first-order linearized surrogates does not align with the second-order nature of flat minima, which are characterized by local curvature. To resolve this, they propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM approach by fixing the loss budget instead of the perturbation radius. This reformulation allows for a dynamic adjustment of the perturbation radius based on the local landscape, effectively shifting the optimization focus from gradient-dominated signals to curvature-dominated terms. The authors conduct extensive experiments across various benchmarks, demonstrating that LE-SAM consistently outperforms SAM and its variants, achieving state-of-the-art performance in generalization. The findings highlight the importance of addressing the objective mismatch in adversarial perturbation mechanisms to enhance the robustness and stability of optimization in deep learning.
Methodology
The authors reformulate the adversarial perturbation mechanism of SAM by fixing the loss budget in loss space, allowing for a dynamic adjustment of the perturbation radius based on local curvature. This approach removes the influence of gradient norms and emphasizes second-order curvature information throughout the training process.
Results
LE-SAM shows superior performance compared to SAM and its variants in various experiments, achieving state-of-the-art results in generalization across multiple tasks and benchmarks.
Implications
The findings suggest that optimizing for curvature rather than gradient norms can lead to more stable and robust training in deep learning models, potentially influencing future research and applications in model optimization and generalization strategies.
Hierarchical Multi-Fidelity Learning for Predicting Three-Dimensional Flame Wrinkling and Turbulent Burning Velocity
Theory
- Development of MuFiNNs framework for predicting flame dynamics.
- Integration of high-fidelity and low-fidelity data to enhance predictive accuracy.
- Demonstrated effectiveness in data-limited regimes with sparse experimental data.
- Ability to interpolate and extrapolate across various operating conditions.
Read more
Hierarchical Multi-Fidelity Learning for Predicting Three-Dimensional Flame Wrinkling and Turbulent Burning Velocity
Summary
This paper presents a novel hierarchical multi-fidelity neural network framework (MuFiNNs) designed to predict the dynamics of turbulent premixed flames, specifically focusing on three-dimensional flame wrinkling and turbulent burning velocity. The authors address the challenge of limited high-fidelity experimental data in turbulent combustion, which is often expensive and complex to obtain, especially under conditions of high pressure and turbulence. The proposed framework integrates sparse high-fidelity experimental data with structured low-fidelity models that capture dominant physical trends. By combining hierarchical low-fidelity construction with nonlinear multi-fidelity correction, MuFiNNs effectively learns the coupled geometric and reactive behaviors of flames. The methodology is applied to various fuels, pressures, and turbulence intensities, demonstrating its ability to accurately reconstruct flame behavior, interpolate across unseen conditions, and extrapolate beyond the training domain. The results indicate that the framework is robust even in data-limited regimes characterized by noise or experimental challenges, establishing multi-fidelity scientific machine learning as a viable approach for predictive combustion modeling.
Methodology
The authors developed a hierarchical multi-fidelity neural network framework (MuFiNNs) that combines low-fidelity trend models informed by experimental data with high-fidelity measurements. This approach allows for the joint prediction of three-dimensional flame surface evolution and turbulent mass burning velocity, addressing the complexities of turbulent combustion dynamics.
Results
The MuFiNNs framework successfully reconstructed observed flame behavior and demonstrated reliable performance in interpolating and extrapolating across unseen operating conditions. The results showed that the framework could recover physically consistent flame behavior even in regimes with sparse, noisy, or inaccessible data, outperforming conventional data-driven methods.
Implications
The findings suggest that hierarchical multi-fidelity learning can significantly enhance predictive combustion modeling, particularly in scenarios where high-fidelity data acquisition is challenging. This approach can be applied to improve the understanding and prediction of flame dynamics in various combustion applications, including the transition to hydrogen and low-carbon fuels.
Identified-Set Geometry of Distributional Model Extraction under Top-K Censored API Access
NLP
Large Language Models
Theory
- Introduces the first exact identified-set characterization for top-K censored model extraction.
- Establishes computable bounds for KL recovery limits under different logit access models.
- Demonstrates a layered extraction hierarchy with varying recovery rates based on access methods.
- Shows that top-K censoring limits distribution recovery but does not fully prevent capability extraction.
Read more
Identified-Set Geometry of Distributional Model Extraction under Top-K Censored API Access
Summary
This paper investigates the limitations of model extraction from large language model (LLM) APIs that only provide top-K logit scores, thereby censoring the remaining vocabulary. The authors introduce a framework that characterizes the identified set of compatible teacher distributions based on the top-K scores, establishing the total-variation diameter of this set. They derive computable bounds for Kullback-Leibler (KL) divergence recovery limits under both normalized and unnormalized logit access models. The study reveals a layered extraction hierarchy through experiments on a Qwen3 math-reasoning teacher, demonstrating varying recovery rates of private capabilities based on the access mode. The findings indicate that while top-K censoring limits per-position distribution recovery, it does not entirely prevent capability extraction, highlighting a separation between fidelity and transfer in prompt-only logit distillation.
Methodology
The authors treat the censored observation as a partial identification problem, analyzing the compatibility of teacher distributions with top-K observations. They derive theoretical bounds for KL recovery and conduct experiments to evaluate capability extraction across different access modes, including top-K distillation and full-logit distillation.
Results
The study finds that on-task top-K knowledge distillation recovers 12% of private capability, while full-logit distillation recovers 56%, and generation-based extraction achieves 96%. The results indicate that top-K censoring does not provide complete protection against model extraction and that the total-variation diameter of the identified set can be computed from a single API call.
Implications
The findings suggest that while top-K censoring offers some level of protection, it is insufficient as a standalone defense against model extraction. This has implications for the security of LLM APIs and the design of access controls, emphasizing the need for more robust measures to protect against capability extraction.
Unsupervised Process Reward Models
NLP
Large Language Models
Reinforcement Learning
- uPRMs eliminate the need for human annotations and ground-truth verification in training Process Reward Models.
- The proposed method achieves up to 15% accuracy improvement in identifying first erroneous steps compared to LLM-as-a-Judge.
- uPRMs perform comparably to supervised PRMs in test-time scaling and outperform majority voting baselines.
- As a reward signal in reinforcement learning, uPRMs support more robust policy optimization than supervised PRMs.
Read more
Unsupervised Process Reward Models
Summary
This paper introduces Unsupervised Process Reward Models (uPRMs), a novel approach to training Process Reward Models without the need for human annotations or ground-truth verification. Traditional Process Reward Models require extensive expert labeling for each reasoning step, which is costly and difficult to scale. The authors propose a method that leverages the next-token probabilities of large language models (LLMs) to create a scoring function that evaluates the plausibility of error positions in reasoning trajectories. By assessing multiple reasoning paths simultaneously, uPRMs can effectively identify first erroneous steps and provide dense feedback for guiding reasoning processes. The paper demonstrates the effectiveness of uPRMs through various experiments, showing significant improvements in accuracy and robustness compared to existing methods, including supervised PRMs and majority voting baselines. Overall, the findings suggest that uPRMs can facilitate scalable reward modeling for complex reasoning tasks, paving the way for more efficient training of LLMs in various applications.
Methodology
The authors developed uPRMs by defining a scoring function based on the next-token probabilities from LLMs, which evaluates the correctness of reasoning steps. This scoring function is optimized through reinforcement learning, allowing the model to learn from the implicit judgments encoded in the LLM's outputs without requiring explicit annotations.
Results
uPRMs demonstrated up to 15% absolute accuracy improvements in identifying erroneous reasoning steps on the ProcessBench dataset. In test-time scaling experiments, uPRMs outperformed the majority voting baseline by up to 6.9% and showed competitive performance against supervised PRMs. Additionally, when used as a reward signal in reinforcement learning, uPRMs led to a 4% accuracy gain over traditional outcome rewards, indicating enhanced robustness against reward hacking.
Implications
The development of uPRMs has significant implications for the scalability of reward modeling in machine learning, particularly in applications requiring complex reasoning, such as mathematics and programming. This approach could reduce the reliance on costly human annotations, making it easier to train and deploy large language models in various domains.
Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models
NLP
Large Language Models
Generative Models
- DLMs can theoretically generate tokens in arbitrary orders but often behave like autoregressive models in practice.
- The paper introduces a compatibility formulation that connects local denoising conditionals to a common joint distribution.
- A local circulation diagnostic is proposed to measure the order sensitivity of DLMs.
- Path dependence in DLMs is characterized and separated from conditional total correlation and order-specific estimation errors.
Read more
Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models
Summary
This paper addresses the inherent order sensitivity of diffusion language models (DLMs), which are designed to allow arbitrary token generation orders, contrasting with the strict left-to-right approach of autoregressive models. Despite this potential for flexibility, DLMs often exhibit behavior that resembles autoregressive models, particularly under high decoding parallelism. The author identifies this issue as stemming from compatibility between local denoising conditionals and a common joint distribution. The paper introduces a formal framework that defines order-induced pseudo-joints and local denoising circulation, enabling a deeper understanding of how different token update orders affect model performance. The study also distinguishes between path dependence caused by conditional incompatibility and errors arising from specific order estimations. The contributions include a formulation of arbitrary-order denoising as a pseudo-joint invariance problem, a local circulation diagnostic to measure order sensitivity, a characterization of path dependence, and a separation of failure modes in DLMs. The findings suggest that while DLMs can theoretically support order-free generation, practical implementations often revert to structured decoding strategies to maintain quality.
Methodology
The author formalizes the relationship between local denoising conditionals and order-induced pseudo-joints, introducing a local circulation measure to quantify order sensitivity. The analysis distinguishes between different sources of path dependence and employs theoretical constructs to derive empirical diagnostics.
Results
The study establishes that exact order consistency in DLMs is achievable only when local conditionals are compatible. It demonstrates that global order gaps can be decomposed into local circulations, providing a structured approach to understanding the order sensitivity of DLMs. The findings indicate that practical decoding strategies often revert to autoregressive-like behavior to ensure quality.
Implications
The insights from this paper could lead to improved decoding strategies for DLMs that better leverage their potential for arbitrary token generation orders. Understanding the structural compatibility of local conditionals may enhance the design of more efficient and effective language models.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
Large Language Models
Theory
Efficient ML
- Introduces a unified communication-theoretic framework for LLM reliability techniques.
- Derives analytical results linking agent behavior to classical decoding theory.
- Presents a cost-aware router that optimizes technique selection based on empirical performance.
- Empirical evaluations show significant improvements in quality and cost efficiency.
Read more
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
Summary
This paper presents a novel communication-theoretic framework for enhancing the reliability of agents built on large language models (LLMs). The authors argue that existing reliability techniques, such as retry loops and majority voting, lack a unified analytical framework. They propose treating LLMs as discrete stochastic channels, leveraging concepts from Shannon's coding theory to develop a comprehensive framework called AGENTCODEC. This framework includes six classical reliability operators adapted for LLMs, allowing for a systematic approach to reliability. The authors derive two key analytical results: a noise-variance threshold that informs the effectiveness of averaging techniques and a contractivity criterion for generator-critic refinement. Additionally, they introduce a cost-aware semantic-nearest-neighbor router that optimizes technique selection based on quality and cost without requiring retraining. Empirical evaluations across various configurations demonstrate that their router significantly outperforms fixed techniques in terms of cost and quality, advocating for a more dynamic approach to LLM reliability.
Methodology
The authors formalize LLMs as discrete stochastic channels and develop AGENTCODEC, which incorporates six communication-theoretic reliability operators. They derive analytical results related to noise thresholds and iterative decoding, and introduce a semantic-nearest-neighbor router that selects techniques based on a cost-regularized objective. Empirical evaluations are conducted across multiple LLM configurations and tasks to validate the proposed methods.
Results
The proposed router achieves approximately 56% lower normalized cost compared to the best fixed technique at matched quality, and improves quality by about 7% at matched normalized cost. The empirical evaluations across 69 tasks and a 300-item hard split demonstrate the effectiveness of the framework and the router in optimizing reliability techniques.
Implications
This framework has the potential to streamline the design of LLM agents by providing a systematic approach to reliability. It encourages the development of adaptive systems that can dynamically allocate resources based on task requirements, leading to more efficient and effective AI applications.