AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
69
Papers today
8h
Update frequency
7
Days of history
Prediction Is Not Physics: Learning and Evaluating Conserved Quantities in Neural Simulators
Generative Models
Theory
Robotics
- Neural networks can predict motion but often violate conservation laws.
- Different model designs significantly affect the recovery of conserved quantities.
- Training duration and data quality are critical for the performance of polynomial CDNs.
- The structured energy model's advantage decreases in the presence of noise.
Read more
Prediction Is Not Physics: Learning and Evaluating Conserved Quantities in Neural Simulators
Summary
This paper investigates the ability of neural networks to learn and evaluate conserved quantities from physical trajectories, particularly in the context of Hamiltonian systems. The authors highlight a significant gap between the prediction accuracy of diffusion models trained on Hamiltonian trajectories and their ability to preserve conservation laws, with energy standard deviations being orders of magnitude larger than ground truth. The study benchmarks three Hamiltonian systems: projectile motion, pendulum, and spring-mass, using various models including a structured energy model, a black-box Conservation Discovery Network (CDN), and a polynomial CDN. The findings reveal that while neural networks can approximate motion, they often violate conservation principles. The structured energy model performs well on clean data, achieving high RΒ² values, but its advantage diminishes under noise. The black-box CDN shows robustness to noise and performs competitively, while the polynomial CDN's performance is highly dependent on training duration and data availability. The paper concludes that achieving accurate trajectory predictions does not necessarily imply conservation of energy, emphasizing the need for models that can learn invariant properties directly from state observations.
Methodology
The authors employed a structured energy model, a black-box CDN, and a polynomial CDN to learn conserved quantities from trajectories of three Hamiltonian systems. They evaluated the models based on their ability to recover analytical energy values and assessed the impact of training schedules and noise on model performance.
Results
The structured energy model achieved RΒ² values of β₯0.9999 on clean data, while the black-box CDN reached RΒ² β₯0.996 with alignment loss. The polynomial CDN's performance varied significantly, with RΒ² improving from 0.78 to 0.9998 with extended training. Under 1% Gaussian noise, the CDN outperformed the structured model in two systems, indicating its robustness.
Implications
This research suggests that while neural networks can approximate physical systems, they require careful design and training to ensure adherence to conservation laws. The findings could inform the development of more physically aware neural models and enhance the understanding of invariant properties in dynamical systems.
When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System
Reinforcement Learning
Optimization
Theory
- Introduction of Disagreement-Guided Reward Poisoning (DGRP) attack targeting SAC agents in CRNs.
- DGRP exploits substantial disagreement between dual critics to corrupt rewards and misguide learning.
- The attack significantly reduces the performance improvements from RIS, impacting transmission quality.
- DGRP is shown to be more damaging than existing reward poisoning strategies based on periodic timing or exploration triggers.
Read more
When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System
Summary
This paper addresses the vulnerabilities of learning-based wireless control systems to reward-poisoning attacks, specifically in the context of Cognitive Radio Networks (CRNs) aided by Reconο¬gurable Intelligent Surfaces (RIS). The authors propose a novel attack strategy called Disagreement-Guided Reward Poisoning (DGRP), which targets a Soft Actor-Critic (SAC) agent tasked with maximizing the long-term data rate for secondary users (SUs). The DGRP attack is adaptive and activates when there is significant disagreement between the SAC's dual critics, particularly in high-leverage and high-uncertainty states. This results in distorted value estimations and leads the policy towards suboptimal actions. The study demonstrates that DGRP significantly undermines the performance benefits typically provided by RIS, degrading transmission quality. The authors also analyze the impact of various attack parameters on the learning process and show that DGRP consistently inflicts greater damage compared to existing baseline methods, emphasizing the need to consider disagreement-aware threats in evaluating the robustness of Deep Reinforcement Learning (DRL) systems in realistic wireless environments.
Methodology
The authors developed the DGRP attack within a CRN environment, utilizing a Soft Actor-Critic (SAC) agent. The attack is designed to corrupt rewards based on the level of disagreement between the agent's critics, particularly in states where the agent's uncertainty is high. The performance of the DGRP attack is compared against baseline methods such as periodic-timing and exploration-triggered attacks to assess its effectiveness.
Results
The results indicate that the DGRP attack leads to a significant degradation in the performance of the SAC agent, undermining the expected benefits of RIS in maximizing the data rate for secondary users. The study found that DGRP consistently caused greater performance degradation compared to the baseline methods, demonstrating its effectiveness as a state-adaptive attack.
Implications
The findings underscore the necessity for enhanced security measures in DRL applications, particularly in wireless communication systems. The research highlights the potential for adaptive attacks to exploit vulnerabilities in learning-based systems, suggesting that future work should focus on developing robust defenses against such threats.
Simply Stabilizing the Loop via Fully Looped Transformer
NLP
Large Language Models
Efficient ML
- Introduction of Fully Looped Transformer (FLT) to stabilize training of looped models.
- Identification of training instability issues: gradient oscillation and residual explosion.
- Parameter-free architectural modifications improve training dynamics.
- FLT achieves stable training with up to 12 loop iterations, unlike baseline models.
Read more
Simply Stabilizing the Loop via Fully Looped Transformer
Summary
The paper introduces the Fully Looped Transformer (FLT), an enhancement of the Looped Transformer architecture aimed at improving training stability and performance without increasing model parameters. The authors identify two main issues causing instability in Looped Transformers: gradient oscillation and residual explosion. To mitigate these, FLT incorporates two parameter-free modifications: the Fully Looped Architecture, which distributes inter-loop signals across all layers to prevent residual explosion, and Attention Injection, which reuses the attention block to control gradient oscillation. These innovations allow FLT to be trained stably with up to 12 loop iterations, significantly outperforming baseline models that collapse under similar conditions. The experiments demonstrate that FLT not only stabilizes training but also enhances average downstream-task performance by up to 13.2%, while providing flexibility in adjusting compute budgets during inference by varying loop iterations.
Methodology
The authors conducted a thorough analysis of the Looped Transformer to diagnose its training instabilities. They proposed the Fully Looped Transformer with two key modifications: Fully Looped Architecture and Attention Injection, both of which are parameter-free. The effectiveness of FLT was evaluated through loop-scaling and ablation experiments to assess stability and performance across various configurations.
Results
The Fully Looped Transformer was able to maintain stable training dynamics with up to 12 loop iterations, where other looped models failed. Additionally, FLT improved average performance on downstream tasks by as much as 13.2%, demonstrating its effectiveness in enhancing model capabilities without increasing parameter count.
Implications
The findings suggest that the Fully Looped Transformer can be a viable alternative for scaling model performance without the need for larger datasets or increased model size. This approach could be particularly beneficial in scenarios where computational resources are limited or where data availability is a concern.
On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis
Computer Vision
Efficient ML
- First application of domain-incremental learning to X-ray pneumonia detection.
- Development of a dual-stage balanced buffer for maintaining class balance in replay.
- Introduction of a dynamic class-weighted loss function to address intra-batch imbalances.
- PneumoNet achieves high accuracy and low forgetting under resource constraints.
Read more
On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis
Summary
This paper presents PneumoNet, a novel approach to pneumonia detection from chest X-rays using domain-incremental learning (DIL) tailored for resource-limited, on-device applications. Traditional deep learning models for pneumonia detection often falter under domain shifts caused by variations in imaging conditions, patient demographics, and institutional practices. The proposed method addresses these challenges by integrating a lightweight convolutional neural network (CNN) with a dual-stage balanced buffer for class-balanced replay and a dynamic class-weighted loss function to mitigate class imbalance during training. The authors created a custom domain-shifted PneumoniaMNIST dataset to systematically evaluate the model across five realistic domain change scenarios. PneumoNet achieved an accuracy of 86.6% with only 1.4% forgetting, demonstrating its efficiency and robustness in adapting to new data distributions while maintaining high diagnostic performance. This work is significant as it is the first to apply DIL to X-ray pneumonia diagnosis, highlighting the potential for adaptive, privacy-preserving AI solutions in real-world healthcare settings.
Methodology
The methodology involves a lightweight CNN architecture for on-device predictions, a dual-stage balanced buffer that ensures class-balanced replay of past samples, and a dynamic class-weighted loss function that adapts to class imbalances within each training batch. This approach allows the model to continuously learn from new data while preserving knowledge from previous domains.
Results
PneumoNet was evaluated on a custom PneumoniaMNIST dataset simulating five different domain shifts, achieving an accuracy of 86.6% and a forgetting rate of only 1.4%. The model demonstrated superior efficiency compared to existing baselines, making it suitable for deployment in mobile and portable medical devices.
Implications
The findings suggest that PneumoNet can facilitate timely and accurate pneumonia diagnosis in diverse clinical settings, particularly in resource-constrained environments. This approach could enhance the deployment of AI-driven diagnostic tools in real-world healthcare, especially during pandemics or in remote locations.
LLM Benchmark Datasets Should Be Contamination-Resistant
Large Language Models
NLP
Theory
- Benchmark dataset contamination significantly undermines the reliability of LLM evaluations.
- Contamination-resistant datasets should be unlearnable during training but usable for inference.
- The asymmetry in Transformer architectures can be leveraged to support contamination resistance.
- Mathematical advancements are necessary for ensuring interoperability across different LLM architectures.
Read more
LLM Benchmark Datasets Should Be Contamination-Resistant
Summary
This paper addresses the critical issue of benchmark dataset contamination in the evaluation of large language models (LLMs). The authors argue that many benchmark datasets are included in pretraining corpora, leading to contamination that undermines their reliability as measures of model generalization. They propose that benchmark datasets should be contamination-resistant, meaning they should be unlearnable by models during training but still support inference. The paper highlights the prevalence of contamination, discusses the asymmetry between training and inference pipelines in Transformer architectures, and outlines mathematical advancements to ensure interoperability across various LLM architectures. The authors call for the community to adopt contamination-resistant methodologies and integrate them into existing evaluation frameworks to maintain the integrity of LLM benchmarking.
Methodology
The authors analyze the prevalence of benchmark contamination and propose a framework for contamination-resistant datasets. They leverage the architectural asymmetry of Transformers to ensure that benchmark data remains unlearnable during training while still being applicable for inference. The paper also discusses mathematical advancements that facilitate interoperability across various LLM architectures.
Results
The paper provides evidence of widespread contamination in benchmark datasets, with contamination levels reaching up to 91.8% in some cases. It illustrates how even minor contamination can significantly compromise the integrity of evaluation metrics. The proposed methodologies aim to create benchmarks that can reliably assess model generalization without being susceptible to contamination.
Implications
The findings suggest that adopting contamination-resistant benchmarks could lead to more accurate evaluations of LLMs, fostering innovation and trust in model performance. This shift could also influence how future benchmarks are developed and maintained, ensuring they remain effective tools for assessing model capabilities.
Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning
Reinforcement Learning
Graph Learning
Theory
- Introduces a planner-admissibility criterion for sparse goal-conditioned planning.
- Demonstrates that AMLE outperforms harmonic averaging in maintaining local greedy orderings.
- Establishes a theoretical certificate linking local value errors to rollout success.
- Empirical results validate the effectiveness of AMLE across various graph configurations.
Read more
Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning
Summary
This paper addresses the challenge of sparse goal-conditioned planning in reinforcement learning, where only a limited number of cost-to-go labels are available. The author formulates this problem as a graph-PDE Dirichlet extension, aiming to extend labels from a small boundary to unlabelled vertices to facilitate greedy rollouts that reach the goal. The central argument is that planner admissibility is more critical than pointwise error in ensuring successful rollouts. The paper contrasts various methods within the graph p-Laplacian family, highlighting that harmonic averaging (p=2) often fails to maintain local greedy orderings, while higher values of p and the Absolutely Minimal Lipschitz Extension (AMLE) demonstrate superior performance. A key theoretical contribution is a planner-admissibility certificate, which states that if the local value error remains below half the true action gap during the rollout, the goal will be reached. Empirical evaluations on 120 AntMaze graph configurations show that AMLE significantly outperforms harmonic averaging, achieving higher rollout success rates and demonstrating its effectiveness in preserving local action orderings.
Methodology
The paper employs a graph-PDE framework to model the value extension problem, using the operational argmin-Q planner for decision-making. It analyzes the performance of different graph p-Laplacian methods, particularly focusing on harmonic averaging and AMLE, and provides theoretical proofs to establish the conditions under which these methods succeed or fail.
Results
The empirical results indicate that AMLE achieves a rollout success rate of 0.970, compared to 0.584 for harmonic averaging (p=2). The study also shows that AMLE significantly reduces the local neighbor ordering errors and improves the accuracy of action selection in sparse goal-conditioned scenarios.
Implications
The findings suggest that using AMLE for value extension in sparse goal-conditioned planning can lead to more efficient and reliable planning in reinforcement learning applications. This has potential implications for various domains, including robotics and autonomous systems, where effective goal-directed behavior is crucial.
D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market
Reinforcement Learning
Generative Models
Optimization
- D3-Subsidy optimizes driver subsidies in ride-hailing markets under strict budget constraints.
- The framework utilizes a diffusion-based model for generating future trajectories from historical data.
- A context-conditioned inverse module translates high-level plans into actionable control signals.
- Real-world testing shows a 1.59% increase in completed rides and a 2.06% increase in GMV.
Read more
D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market
Summary
The paper addresses the challenge of optimizing driver-side subsidies in ride-hailing platforms like DiDi Chuxing, which operate in dynamic environments requiring a balance between driver supply and passenger demand. The authors propose D3-Subsidy, a hierarchical diffusion-based framework designed for online sequential decision-making that adheres to three critical constraints: responsiveness to stochastic shocks, strict subsidy-rate caps, and low-latency execution at city scale. D3-Subsidy employs a prefix-conditioned diffusion model to generate plausible future trajectories based on historical data, ensuring that the training aligns with the fixed-history nature of online deployment. The framework also includes a context-conditioned inverse module to translate generated plans into actionable city-level control signals. The authors implement a Lagrangian-dual-derived mapping to efficiently manage subsidy-rate caps and enhance scalability. Through extensive offline evaluations and real-world A/B testing, D3-Subsidy demonstrates significant improvements in completed rides and gross merchandise value while maintaining compliance with budget constraints.
Methodology
The authors develop D3-Subsidy, which integrates a prefix-conditioned diffusion model for trajectory generation and a context-conditioned inverse dynamics module for action inference. The framework incorporates a constraint-aware score to ensure budget feasibility and employs a Lagrangian-dual-derived mapping for scalable execution. Multi-city pretraining with parameter-efficient fine-tuning is also utilized for robust transfer across different urban environments.
Results
D3-Subsidy was evaluated using real-world data from three cities, resulting in a 1.59% increase in completed rides and a 2.06% improvement in gross merchandise value, all while adhering to budget constraints. A/B testing in a production environment confirmed these improvements without exceeding operational thresholds.
Implications
The findings suggest that D3-Subsidy can significantly enhance the efficiency of subsidy allocation in ride-hailing platforms, potentially leading to improved service quality and profitability. The framework's scalability and adaptability make it suitable for deployment in various urban settings, addressing the challenges of dynamic demand and driver participation.
Active Context Selection Improves Simple Regret in Contextual Bandits
Theory
Optimization
- Active context selection can significantly reduce simple regret in contextual bandits compared to passive sampling.
- The proposed active sampling strategy achieves a tight regret rate that can improve by Ξ(k1/4), where k is the number of contexts.
- The EETC algorithm optimally balances exploration and exploitation when the context distribution is unknown, matching known distribution rates for large horizons.
- The analysis extends to budgeted active sampling, providing insights into the minimum budget required to achieve optimal performance.
Read more
Active Context Selection Improves Simple Regret in Contextual Bandits
Summary
This paper investigates the contextual multi-armed bandit problem, focusing on the scenario where a learner recommends actions based on a finite context space, evaluated through context-weighted simple regret. The authors propose a novel approach that allows the learner to actively select which context to sample from, contrasting with traditional passive sampling methods. They derive tight regret rates for both passive and active sampling strategies, demonstrating that active sampling can significantly reduce regret, achieving improvements proportional to the number of contexts. The paper also explores budgeted active sampling, identifying conditions under which limited active sampling can still yield optimal results. Furthermore, when the context distribution is unknown, the authors introduce the Explore-Explore-Then-Commit (EETC) algorithm, which balances the exploration of context distributions with active allocation, achieving performance close to that of the optimal policy with known distributions. Experimental results validate the theoretical findings, showcasing the effectiveness of active context selection in reducing regret.
Methodology
The authors analyze the contextual bandit problem by characterizing regret rates for both passive and active sampling strategies. They derive optimal allocation policies for known context distributions and extend the analysis to budgeted settings. The EETC algorithm is introduced for scenarios with unknown distributions, balancing exploration and active sampling.
Results
The study shows that active context selection can achieve a tighter regret rate compared to passive sampling, with potential improvements of Ξ(k1/4). The EETC algorithm performs comparably to the optimal policy with known distributions for large horizons, and the analysis of budgeted active sampling reveals conditions for achieving optimal regret rates.
Implications
The findings suggest that actively selecting contexts can enhance decision-making in various applications, such as clinical trials and targeted marketing strategies, where understanding subpopulation dynamics is crucial. The results also provide a framework for designing efficient experimental protocols that leverage active sampling.
INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
Time Series
Interpretability
- INSHAPE discovers instance-specific shapelets that improve classification performance and interpretability.
- The framework models temporal dependencies among shapelets, addressing limitations of existing methods.
- It provides both local and global interpretability by aggregating instance-level shapelets into population-level insights.
- Extensive experiments show INSHAPE's superior performance on benchmark datasets compared to traditional shapelet methods.
Read more
INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
Summary
The paper introduces INSHAPE, a novel framework for time-series classification (TSC) that focuses on discovering instance-level shapeletsβdiscriminative temporal patterns tailored to individual time series. Traditional methods primarily rely on population-level shapelets, which often misalign with specific instances and overlook temporal dependencies among patterns. INSHAPE addresses these limitations by identifying non-overlapping segments within each time series and modeling their temporal interactions. This approach not only enhances local interpretability by providing clear insights into specific instances but also aggregates these instance-level shapelets into population-level shapelets, bridging local and global interpretability. The authors conducted extensive experiments on 128 UCR and 30 UEA benchmark datasets, demonstrating that INSHAPE consistently outperforms state-of-the-art shapelet-based methods while offering more intuitive and interpretable results.
Methodology
INSHAPE partitions each time series into variable-length segments and employs a gating mechanism, which includes an amortized selector and a temporal predictor, to identify the most predictive segments for classification. This method captures complex temporal interactions and provides clear local explanations for predictions.
Results
The experiments revealed that INSHAPE outperforms existing shapelet-based models across multiple benchmark datasets, achieving higher classification accuracy while offering more interpretable insights into the decision-making process.
Implications
The findings suggest that INSHAPE can be particularly valuable in high-stakes domains such as healthcare, where understanding both global patterns and individual case-specific insights is crucial for decision-making.
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
Large Language Models
Efficient ML
Optimization
- LEAP introduces a new parameterization for unstructured pruning that is scalable and tractable for large language models.
- The method improves zero-shot accuracy significantly compared to existing layer-wise pruning methods.
- LEAP operates on frozen pretrained weights, allowing for easier deployment and integration with fine-tuning.
- The framework is validated across various LLM families and demonstrates consistent performance improvements at high sparsity levels.
Read more
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
Summary
The paper introduces LEAP, a novel framework for learnable end-to-end adaptive pruning of large language models (LLMs) that addresses the limitations of existing layer-wise pruning methods. Traditional approaches, such as those based on the Optimal Brain Surgeon principle, often lead to significant accuracy loss, especially under high sparsity conditions. LEAP overcomes this by employing a per-weight Bernoulli-via-Gumbel-sigmoid parameterization, which allows for tractable end-to-end learning of unstructured masks. This method is particularly advantageous as it scales with the number of weights rather than the combinatorial explosion of valid patterns, making it feasible for large models. The authors demonstrate that LEAP can effectively prune LLMs while maintaining high accuracy, achieving improvements in zero-shot accuracy across multiple model families and sparsity levels. The framework operates on frozen pretrained weights, simplifying deployment and compatibility with future fine-tuning stages.
Methodology
LEAP utilizes a per-weight Bernoulli-via-Gumbel-sigmoid relaxation to learn unstructured masks for LLMs. This approach replaces the categorical-over-patterns parameterization used in previous methods, allowing for end-to-end differentiability and efficient optimization. The framework incorporates a combination of techniques, including Wanda-based initialization, a global sparsity regularizer, and a magnitude-aware term to stabilize the optimization process.
Results
LEAP was tested on five LLM families ranging from 0.5B to 8B parameters at 50% and 60% sparsity. It achieved an average improvement of +2.59 points in six-task zero-shot accuracy over ADMM, the best layer-wise baseline, with a peak improvement of +5.40 points on the LLaMA-3.2 1B model at 60% sparsity.
Implications
The development of LEAP has significant implications for the deployment of large language models, particularly in resource-constrained environments. By enabling effective pruning without sacrificing accuracy, LEAP can facilitate the use of LLMs in applications where computational efficiency is critical, such as mobile devices or edge computing.
Federated Martingale Posterior Sampling
Federated Learning
Theory
Optimization
- Introduces Federated Martingale Posterior (FMP) sampling for federated Bayesian neural networks.
- FMP allows clients to upload compressed data embeddings, reducing the need for full dataset sharing.
- Demonstrates improved calibration and predictive performance over consensus-style baselines.
- Validates the method on multiple datasets, showing its effectiveness in both homogeneous and heterogeneous client scenarios.
Read more
Federated Martingale Posterior Sampling
Summary
This paper addresses the challenges of federated Bayesian neural networks, particularly the difficulty in specifying meaningful priors and likelihoods for overparameterized models. The authors introduce Federated Martingale Posterior (FMP) sampling, a novel protocol that allows clients to upload compressed data embeddings instead of full datasets, enabling a centralized predictive sampler to operate efficiently. The FMP method is designed to approximate the centralized martingale posterior by using a shared set-transformer predictor, which aggregates client embeddings and generates predictive samples. The authors validate their approach through experiments on standard datasets (MNIST, CIFAR-10, CIFAR-100), demonstrating that FMP closely matches centralized methods while significantly improving calibration compared to consensus-style baselines. This work highlights the potential of FMP sampling in enhancing federated learning frameworks by reducing data transmission requirements and improving predictive performance.
Methodology
The authors propose a one-shot federated sampling protocol where each client compresses its local dataset into a small set of trainable embeddings using an attention-based pooling block. These embeddings are uploaded to a central server, which aggregates them and runs a predictive sampler to generate samples from the martingale posterior. A meta-training procedure is also designed to align FMP samples with those from the centralized martingale posterior across related tasks.
Results
Experiments conducted on MNIST, CIFAR-10, and CIFAR-100 indicate that the FMP method closely approximates the performance of centralized sampling methods. Additionally, FMP shows significant improvements in calibration metrics compared to traditional consensus federated methods, demonstrating its effectiveness in practical applications.
Implications
The proposed FMP sampling method has the potential to enhance federated learning systems by enabling better predictive performance while minimizing data sharing. This can lead to more privacy-preserving machine learning applications, particularly in sensitive domains where data confidentiality is paramount.
Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew
Generative Models
Interpretability
- Introduction of MUCS, a novel method for TDA in diffusion models.
- MUCS combines mirrored unlearning with noise-consistent skew for improved reliability.
- Demonstrated significant performance improvements over existing TDA methods.
- Analysis of influential instance overlap and the effectiveness of ensemble TDA approaches.
Read more
Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew
Summary
This paper addresses the challenge of training data attribution (TDA) in generative models, particularly diffusion models, which has been hindered by issues of reliability and robustness in existing methods. The authors propose a novel approach called mirrored unlearning and noise-consistent skew (MUCS) that enhances TDA by fine-tuning a second model using bounded mirrored gradient ascent and measuring the normalized skew of this model against the original using consistent noise samples. The proposed method is shown to outperform existing TDA techniques across three datasets, demonstrating significant improvements in performance. The authors also investigate the impact of design choices on the method's effectiveness and explore the overlap of influential instances in generated items, as well as the potential benefits of ensemble approaches for TDA. The findings suggest broader implications for unlearning methodologies and tasks that involve comparing diffusion losses.
Methodology
The MUCS method involves a two-step process: first, unlearning a generated item to create an unlearned model, and second, averaging noise-consistent loss skews to compute attribution scores. The method employs a null loss reference to guide the unlearning process and utilizes a combination of retention and forgetting losses to enhance control over the unlearning mechanism.
Results
MUCS consistently outperformed existing TDA methods across three different datasets, showcasing its robustness and reliability. The analysis revealed that design choices significantly affect performance, and the method's ability to measure the overlap of influential instances provided deeper insights into the generative process.
Implications
The findings from this research could lead to more effective TDA applications in generative modeling, enhancing interpretability and enabling practical uses such as data selection and leakage estimation. Additionally, the insights into unlearning processes may inform future research in related fields.
Scale Determines Whether Language Models Organize Representation Geometry for Prediction
NLP
Large Language Models
Theory
- Introduction of Subspace PGA, a metric for assessing the alignment of representation geometry with predictive functions.
- Demonstration that predictive organization in language models is scale-dependent, with smaller models losing this organization in later training layers.
- Identification of a capacity trade-off where dominant directions in smaller models deviate from the readout subspace, masking predictive structure.
- Large models maintain predictive organization throughout their layers, contrasting with the detour observed in smaller models.
Read more
Scale Determines Whether Language Models Organize Representation Geometry for Prediction
Summary
This paper investigates how the geometric organization of representations in language models influences their predictive capabilities, introducing a new metric called Subspace PGA. The study reveals that the distance structure of intermediate layers in language models is significantly organized for prediction, with this organization being scale-dependent. Through experiments on seven Pythia models and three cross-family models, the author finds that smaller models (with dimensions d β€ 1024) progressively lose this predictive organization in later layers during training, while larger models (d β₯ 2048) maintain it consistently. The loss of organization in smaller models is attributed to a capacity trade-off where dominant directions in the representation space rotate away from the readout subspace, masking the underlying predictive structure. The findings highlight that neither spectral metrics nor loss curves adequately capture this distinction, emphasizing the importance of model scale in determining how representation geometry is organized for prediction.
Methodology
The paper introduces Subspace PGA, which involves performing Singular Value Decomposition (SVD) on the unembedding matrix to identify the readout subspace. The distance structure of hidden states is then projected onto this subspace and compared against random subspaces of equal dimensionality to quantify how well the geometry aligns with predictive functions.
Results
The results indicate that intermediate geometry is organized for prediction, with peak z-scores reaching 9-24 in mid layers. Smaller models lose this organization in late layers, with z-scores dropping significantly, while larger models maintain alignment throughout. The study also finds that removing dominant directions in smaller models can restore predictive alignment.
Implications
These findings suggest that the scale of language models plays a crucial role in their predictive performance and representation organization, which could inform future model design and training strategies. Understanding the geometric properties of representations may lead to improved architectures that better leverage predictive capabilities.
Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space
Generative Models
Theory
- Establishes the first convergence bounds independent of state space size S, applicable to masked distributions.
- Unified derivation covering all integral probability metrics (IPMs) based on a single rate-matrix condition.
- Introduces novel techniques, including adjoint equations in the space of observables and coupling arguments.
- Framework extends beyond convergence bounds, offering a toolkit for broader theoretical analyses.
Read more
Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space
Summary
This paper addresses the limitations of existing convergence theories for discrete diffusion models, which are widely used in generative modeling across various domains, including language and vision. The authors propose a unified framework based on adjoint equations that provides dimension-free convergence guarantees in any integral probability metric (IPM). This is significant as previous analyses often exhibited S-dependence, particularly problematic in large vocabulary tasks. The framework is applicable to both masked and uniform priors and relies on a single standard rate-matrix regularity assumption, accommodating time-inhomogeneous schedules. The authors introduce four novel techniques: working in the space of observables via adjoint equations, a regularity analysis yielding bounds on any IPM, a coupling argument that eliminates S-dependence under uniform transitions, and a score-marginal cancellation technique that addresses masked transitions. These innovations collectively enhance the theoretical understanding of discrete diffusion models and provide a versatile toolkit for future research.
Methodology
The authors develop a framework based on adjoint equations, specifically the Kolmogorov backward equation, to analyze discrete diffusion models. They employ techniques such as regularity analysis, coupling arguments, and score-marginal cancellation to derive dimension-free convergence bounds applicable to various integral probability metrics.
Results
The framework successfully establishes convergence bounds that are entirely independent of the state space size S and applicable to both masked and uniform priors. It also demonstrates compatibility with time-inhomogeneous rates, marking a significant advancement over previous analyses that were limited by S-dependence and specific metric restrictions.
Implications
The findings have significant implications for generative modeling, particularly in language tasks where vocabulary sizes are large. The dimension-free convergence guarantees enhance the reliability of discrete diffusion models, potentially leading to improved performance in applications such as language modeling, image generation, and other discrete data synthesis tasks.
Learning over Positive and Negative Edges with Contrastive Message Passing
Graph Learning
Theory
- Negative edges can provide significant information gain under specific graph conditions.
- Contrastive Message Passing (CMP) effectively integrates both positive and negative edges in GNNs.
- CMP outperforms standard GNNs and contrastive learning methods in low-label scenarios.
- Theoretical analysis provides guidance on when to utilize negative edges for improved performance.
Read more
Learning over Positive and Negative Edges with Contrastive Message Passing
Summary
This paper addresses the limitations of conventional graph neural networks (GNNs) that primarily utilize positive edges for message passing, neglecting the informative potential of negative edges. The authors conduct a theoretical analysis demonstrating that negative edges can provide significant information gain in scenarios characterized by high homophily, high edge density, and low label rates. To leverage this insight, they propose Contrastive Message Passing (CMP), a novel architecture that integrates both positive and negative edges into the message passing process. CMP applies similarity-preserving transformations to positively connected nodes and dissimilarity-inducing transformations to negatively connected nodes, enhancing the model's ability to learn effective representations. The authors validate CMP through extensive experiments on simulated and real-world datasets, showing that it consistently outperforms standard GNNs and other contrastive learning methods, particularly in low-label settings where negative edges are informative.
Methodology
The authors introduce Contrastive Message Passing (CMP), which incorporates contrastive dynamics directly into the message passing architecture of GNNs. CMP reweights the negative eigenvalues of weight matrices to achieve differential transformations for positively and negatively connected nodes. The methodology includes an information-theoretic analysis of Stochastic Block Models to determine the conditions under which negative edges are informative.
Results
CMP demonstrates significant improvements over standard GNNs, achieving up to 25.49% average improvement at the smallest label rates and 14.78% overall improvement. When compared to contrastive learning methods, CMP achieves a 22.95% average improvement. The results validate the theoretical findings, indicating that CMP effectively leverages negative edge information in scenarios predicted to be beneficial.
Implications
The findings suggest that incorporating negative edges can enhance the performance of GNNs in specific contexts, providing a framework for practitioners to determine when to utilize negative edges. This could lead to more robust models in applications such as social network analysis, recommendation systems, and any domain where relationships can be both positive and negative.
Content-Style Identification via Differential Independence
Generative Models
Computer Vision
Theory
- Introduces content-style differential independence (CSDI) for identifying content and style variables in unpaired multi-domain data.
- Imposes a blockwise orthogonality constraint on the Jacobian to achieve identifiability without requiring statistical independence.
- Develops a scalable implementation for high-dimensional data using a multi-domain GAN framework.
- Demonstrates practical benefits in counterfactual generation and domain translation across various datasets.
Read more
Content-Style Identification via Differential Independence
Summary
This paper addresses the challenge of identifying content and style variables in multi-domain data without relying on paired samples, which is often a limitation in practical applications. The authors introduce a novel concept called content-style differential independence (CSDI), which allows for the identification of these variables even when they are dependent and the Jacobian is dense. By imposing a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style, the authors demonstrate that infinitesimal variations in these variables can be orthogonal on the data manifold. This approach circumvents the need for strict statistical independence or sparse Jacobian assumptions that are common in existing methods. To facilitate high-dimensional generative models, the authors develop a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training for tasks like high-resolution image generation. The effectiveness of their method is validated through experiments on multiple datasets, showcasing improvements in counterfactual generation and domain translation.
Methodology
The authors propose a new structural condition called content-style differential independence (CSDI), which requires that variations in content and style induce orthogonal directions on the data manifold. They operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces. Additionally, they implement a stochastic regularizer based on numerical Jacobian approximation to support scalable training in high-dimensional settings.
Results
The proposed method successfully identifies content and style variables from unpaired multi-domain data, achieving identifiability even under conditions where content and style are dependent. The experiments validate the identifiability analysis and demonstrate significant improvements in tasks such as counterfactual generation and domain translation, particularly in high-dimensional contexts.
Implications
The findings of this research have significant implications for various applications, including image translation, domain generalization, and causal discovery, particularly in scenarios where paired data is unavailable. The proposed method can enhance the performance of generative models in real-world applications, such as medical imaging and artistic style transfer.
A Bitter Lesson for Data Filtering
NLP
Large Language Models
Theory
- Data filtering may not be necessary for large-scale model pretraining.
- Sufficiently large models can benefit from low-quality or distractor data.
- The full Common Crawl dataset outperforms filtered versions when models are adequately trained.
- Scaling laws predict compute requirements for optimal performance without filtering.
Read more
A Bitter Lesson for Data Filtering
Summary
This paper investigates the necessity of data filtering in the context of large model pretraining, particularly in high compute, data-scarce regimes. Contrary to the prevailing belief that filtering for high-quality data is essential, the authors present experimental evidence suggesting that, with sufficient computational resources, the optimal approach may be to utilize all available data without filtering. The study reveals that large parameter models can tolerate and even benefit from low-quality or distractor data, challenging the conventional data selection strategies that prioritize high-quality datasets. Through extensive scaling studies, the authors demonstrate that the full Common Crawl dataset outperforms filtered subsets when models are adequately trained. They also establish scaling laws that predict the compute requirements for optimal performance without filtering, indicating that as compute increases, the advantages of filtering diminish. The findings suggest a paradigm shift in data handling for model training, advocating for less aggressive filtering and highlighting the robustness of large models to noisy data.
Methodology
The authors conducted scaling studies by comparing the performance of large models trained on the full Common Crawl dataset against various filtered versions. They manipulated model size and training steps to assess the impact of data filtering on performance, while also exploring the robustness of models to irrelevant or noisy data through experiments with randomly generated strings and shuffled documents.
Results
The experiments revealed that the full Common Crawl dataset consistently outperformed filtered datasets when models were sufficiently large and trained for adequate durations. The authors found that large models could extract useful information from noisy data, and established a predictable relationship between pool size, training steps, and model size that informs scaling laws for optimal performance without filtering.
Implications
These findings suggest a reevaluation of data filtering practices in machine learning, particularly for large language models. The results advocate for leveraging larger datasets without aggressive filtering, which could lead to more efficient training processes and improved model performance in various applications.
BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics
Generative Models
Graph Learning
Time Series
- Introduction of BrainDyn, a sheaf-based neural ODE framework for modeling brain dynamics.
- Utilization of learnable restriction maps for expressive interaction modeling between brain regions.
- Capability to generate brain-like trajectories that reflect complex neural interactions.
- Comprehensive evaluation across multiple modalities, achieving strong performance in modeling neural dynamics.
Read more
BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics
Summary
The paper introduces BrainDyn, a novel sheaf neural ordinary differential equation (neural ODE) model designed to generate brain-like dynamic activity. Traditional neural network models, such as large language models and recurrent neural networks, often overlook the anatomical organization of the brain, leading to outputs that do not align with specific brain regions. In contrast, BrainDyn employs a sheaf-based approach that incorporates structured brain graphs, allowing for more expressive message passing between neuronal units. The model utilizes long short-term memory (LSTM) networks to encode the activity history of each brain region, producing hidden states that are transformed through learnable restriction maps into edge-specific shared spaces. This architecture is governed by a sheaf Laplacian that facilitates communication between nodes while preserving the unique characteristics of each region. The authors evaluated BrainDyn on various datasets, including resting-state fMRI, scalp EEG with focal epilepsy, and simulated neural activity, demonstrating its strong forecasting capabilities and effectiveness in supporting downstream tasks such as perturbation prediction. The results indicate that BrainDyn can generate synthetic data that closely reflects real neural dynamics, making it a valuable tool for analyzing brain activity and testing perturbations.
Methodology
BrainDyn employs a sheaf neural ODE framework that integrates long short-term memory (LSTM) networks to encode the activity history of brain regions. It utilizes learnable restriction maps to transform node features into edge-specific shared spaces, facilitating structured message passing. The sheaf Laplacian is derived from these maps to govern the continuous-time evolution of neuronal activity.
Results
BrainDyn was evaluated on the Philadelphia Neurodevelopmental Cohort (PNC) resting-state fMRI dataset, the Temple University Hospital Seizure (TUSZ) EEG dataset, and simulated activity from the NEST spiking network simulator. The model demonstrated strong forecasting abilities across these modalities and effectively captured the dynamics of neural activity, supporting downstream tasks such as perturbation prediction.
Implications
BrainDyn offers a principled approach to modeling brain dynamics, potentially serving as a valuable resource for generating synthetic neural data, analyzing brain activity under various conditions, and providing insights into the underlying generative dynamics of brain functions.
DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift
Time Series
- DeRegiME separates latent uncertainty regimes from the underlying signal, improving interpretability.
- Utilizes a sparse variational Gaussian process with a nonstationary regime-mixing kernel.
- Demonstrates significant improvements in predictive performance over existing models across multiple datasets.
- Captures complex distribution shifts in time-series data, including abrupt and gradual changes.
Read more
DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift
Summary
The paper introduces DeRegiME (Deep Regime Mixture of Experts), a novel probabilistic forecasting model designed to handle distribution shifts in time-series data. DeRegiME separates latent uncertainty regimes from the underlying signal and assigns forecast locations to learned recurring regimes using a sparse variational Gaussian process (GP). This approach utilizes a nonstationary regime-mixing kernel and a Student-t likelihood to combine per-regime sub-kernels and noise processes through a shared gating mechanism. The model addresses limitations of existing neural forecasters by providing an interpretable mean-residual-noise decomposition, which reveals the regime structure of residual uncertainty. The authors demonstrate that DeRegiME effectively captures abrupt, gradual, and horizon-dependent distribution shifts, yielding significant improvements in predictive performance across various benchmarks. The model's architecture allows for a flexible representation of uncertainty states, enhancing interpretability without requiring auxiliary clustering objectives. Overall, DeRegiME shows promise in advancing probabilistic time-series forecasting under challenging conditions.
Methodology
The methodology involves a deep forecasting architecture that integrates a sparse variational Gaussian process with a regime-mixing kernel and Student-t likelihood. The model employs a gating mechanism to assign probabilities over candidate regimes, allowing for a flexible representation of uncertainty. The predictive density is computed by marginalizing over both the residual and the unobserved regime label, enabling the model to adapt to different uncertainty states across forecast horizons.
Results
DeRegiME achieved a 20.3% improvement in negative log predictive density (NLPD) compared to the strongest baseline model, along with parallel gains of 3.0% in Continuous Ranked Probability Score (CRPS) and 4.7% in Mean Squared Error (MSE). These improvements were consistent across ten benchmarks, demonstrating the model's robustness in handling various types of distribution shifts.
Implications
The findings suggest that DeRegiME can be effectively applied in scenarios where time-series data exhibit complex distribution shifts, such as financial forecasting, climate modeling, and any domain requiring robust probabilistic forecasting. Its interpretability and ability to capture regime structures can enhance decision-making processes in uncertain environments.
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
Large Language Models
Theory
Efficient ML
- Introduction of FLAME framework for automated benchmark generation.
- Generation of benchmarks with broad coverage and rich metadata.
- Expert-reviewed benchmarks demonstrate lower error rates than existing benchmarks.
- FLAME reveals fine-grained performance differences across models.
Read more
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
Summary
The paper presents a novel framework named Fine-grained, LArge-scale Model Evaluation (FLAME) aimed at generating comprehensive evaluation benchmarks for foundation models. Traditional benchmarks often rely on aggregate scores that fail to capture the nuanced capabilities of models, leading to suboptimal model selection and development. FLAME addresses these limitations by automating the benchmark generation process using external knowledge sources, such as textbooks, ensuring broad coverage and rich metadata for fine-grained evaluation. The framework employs a multi-agent architecture for problem generation and a solution-graph-driven strategy to enhance the reliability of ground truth solutions. The authors generated three benchmarks in Machine Learning, Corporate Finance, and Personal Finance, which were validated through expert review. Results indicate that FLAME-generated benchmarks exhibit a significantly lower ground-truth error rate compared to existing benchmarks like MMLU and GSM8K. Furthermore, evaluation of various foundation models on these benchmarks revealed fine-grained performance differences that were previously undetected, thus providing deeper insights into model competencies.
Methodology
The FLAME framework consists of two main stages: source curation and preprocessing, followed by task generation. Domain experts curate external knowledge sources, such as textbooks, to create a taxonomy of competencies. The automated pipeline then generates evaluation tasks based on this structured knowledge, ensuring comprehensive coverage and robustness against data contamination.
Results
The benchmarks generated by FLAME showed a significantly lower ground-truth error rate compared to traditional benchmarks like MMLU and GSM8K. Evaluation of 12 foundation models indicated that FLAME's benchmarks achieved near-uniform competency coverage, revealing performance differences that existing benchmarks failed to capture.
Implications
The FLAME framework has the potential to enhance model evaluation practices by providing more informative benchmarks that can guide model development and selection. This could lead to improved performance in various applications of foundation models across different domains.
Learning What Evaluators Value: A Reliable Approach to Modeling Evaluator Preferences
Theory
Interpretability
NLP
- The paper introduces a minimal assumption framework for modeling evaluator preferences, focusing on non-decreasing functions.
- It highlights the issues caused by common modeling assumptions that can lead to significant errors in preference learning.
- A new algorithm is proposed that can learn evaluator preferences robustly, maintaining performance even under model mismatch.
- The effectiveness of the algorithm is validated through both synthetic simulations and real-world applications.
Read more
Learning What Evaluators Value: A Reliable Approach to Modeling Evaluator Preferences
Summary
This paper addresses the challenge of modeling evaluator preferences in various decision-making contexts, such as academic admissions and medical diagnoses, where evaluators assess multiple criteria to arrive at an overall evaluation. The authors argue that existing models often make unrealistic assumptions about evaluator preferences, which can lead to significant mismatches and inaccuracies in learning these preferences. They propose a new approach that only assumes the preference function is coordinate-wise non-decreasing, a condition that holds in many practical scenarios. The paper theoretically characterizes the impact of model mismatch and presents an algorithm capable of learning evaluator preferences robustly, even when the linearity assumption is violated. The authors validate their algorithm through synthetic simulations and real-world data, demonstrating its effectiveness in capturing the nuances of both human and LLM evaluator preferences.
Methodology
The authors develop a theoretical framework to characterize model mismatch and propose an algorithm for learning evaluator preferences that is robust to such mismatches. They validate their approach using synthetic data and real-world datasets to demonstrate its effectiveness in accurately capturing evaluator preferences.
Results
The proposed algorithm successfully learns evaluator preferences without significant performance loss, even when the linearity assumption is not met. Evaluations on synthetic and real-world data show that the algorithm can effectively model the mapping from criteria assessments to overall evaluations, revealing insights into both human and LLM preferences.
Implications
The findings suggest that the proposed method can enhance the transparency and reliability of evaluations in various fields, including medical diagnostics and academic peer review. By understanding evaluator preferences better, practitioners can make more informed decisions and reduce discrepancies caused by individual evaluator biases.
Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing
Multimodal
Theory
Interpretability
- Establishes identifiability guarantees for causal representations in multimodal settings with partially shared latent structures.
- Proves identifiability under weaker assumptions than previous works, allowing for undercomplete cases.
- Introduces a differentiable Wasserstein-based module for recovering latent structures, applicable across different architectures.
- Demonstrates superior performance compared to state-of-the-art methods through extensive experiments.
Read more
Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing
Summary
This paper addresses the challenge of identifiability in Causal Representation Learning (CRL) within a multimodal context, where observed data is generated from a partially shared latent structure. The authors establish component-wise identifiability guarantees for causal latent representations without imposing strict parametric distributions on the latent variables. The proposed framework allows for the identification of both shared and modality-specific latent variables, even in undercomplete scenarios where the number of observed variables exceeds the number of latent variables. To implement their theoretical findings, the authors introduce a Wasserstein distance-based module that can be integrated into various architectures with minimal modifications. Extensive experiments demonstrate that their approach outperforms state-of-the-art methods on both synthetic and realistic datasets, highlighting its effectiveness in recovering causal structures in multimodal data.
Methodology
The authors utilize structural causal sparsity constraints to prove identifiability guarantees for causal latent variables. They develop a generative framework that estimates causal relationships and introduce a Wasserstein distance-based module to recover the latent structure, which is architecture-agnostic and requires minimal integration effort.
Results
The proposed method achieves component-wise identifiability for both shared and modality-specific latent variables, even in undercomplete scenarios. Experimental results on synthetic and realistic datasets show that the approach significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness in recovering causal structures.
Implications
The findings of this research have significant implications for various fields, including medical applications and biological sciences, where understanding causal relationships from multimodal data is crucial. The ability to recover interpretable causal representations can enhance the reliability of AI systems and contribute to more trustworthy decision-making processes.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Optimization
- FML-bench separates agent strategy from execution infrastructure, allowing for clearer performance attribution.
- Complexity of agent strategies does not necessarily lead to better performance; simpler strategies can be equally effective.
- Greedy search strategies excel in dense opportunity landscapes, while broader strategies are better for sparse environments.
- An adaptive agent that changes exploration strategies based on performance shows improved results.
Read more
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Summary
The paper introduces FML-bench, a benchmark designed to evaluate AI research agents' strategies in machine learning (ML) research. The authors argue that existing benchmarks do not adequately separate agent strategies from execution infrastructure, making it difficult to attribute performance differences to specific strategies. FML-bench consists of 18 fundamental ML research tasks across 10 domains and defines 12 process-level metrics to analyze agent behaviors beyond final scores. The study evaluates six representative agents, revealing that strategy complexity does not guarantee superior performance. For instance, a simple greedy hill-climber (Autoresearch) performs comparably to a more complex tree-search agent (TAS v2). The findings suggest that greedy search is more effective in dense improvement landscapes, while broader strategies excel in sparse environments. An adaptive agent that switches strategies based on performance stagnation outperforms the others, indicating the importance of exploration dynamics. Additionally, early convergence and focused exploration correlate with better performance, while factors like solution diversity and compute cost do not significantly impact outcomes. The benchmark aims to facilitate controlled comparisons of agent strategies and improve understanding of their dynamics in ML research.
Methodology
The authors developed FML-bench, a benchmark consisting of 18 ML research tasks across various domains. They evaluated six AI research agents using this benchmark, measuring their performance and analyzing their exploration behaviors through 12 defined process-level metrics.
Results
The evaluation revealed that the simple greedy hill-climber (Autoresearch) achieved performance comparable to the more complex tree-search agent (TAS v2). The adaptive agent outperformed all others by switching strategies based on performance stagnation. Analysis showed that early convergence and focused exploration were significantly associated with final performance.
Implications
FML-bench provides a framework for future research on AI agents, enabling better understanding of how different strategies impact ML research outcomes. This could lead to the development of more effective AI research agents and improved methodologies in machine learning.
Fast and Featureless Node Representation Learning with Partial Pairwise Supervision
Graph Learning
Optimization
Efficient ML
- Introduction of Contrastive FUSE for node representation learning without node features.
- Development of a signed, normalized contrastive Laplacian that enhances modularity-based learning.
- Efficient optimization scheme that approximates modularity gradient for faster training.
- Demonstrated competitive performance on benchmark datasets with lower runtime compared to existing methods.
Read more
Fast and Featureless Node Representation Learning with Partial Pairwise Supervision
Summary
This paper introduces Contrastive FUSE, a novel framework for scalable node representation learning in graphs that lack node features and utilize partially available pairwise node labels. The authors propose a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To enhance computational efficiency, they replace the traditional modularity gradient with a lightweight approximation, which maintains the structure-seeking behavior while significantly reducing computational costs. The framework allows for fast iterative updates, making it suitable for large-scale graphs with millions of edges. The experiments conducted on various benchmark datasets, including citation networks and co-purchase graphs, demonstrate that Contrastive FUSE achieves competitive or superior performance in node classification tasks compared to existing methods, while also providing substantial runtime improvements. This work highlights the effectiveness of combining modularity-inspired structural learning with contrastive supervision for efficient and scalable node representation learning.
Methodology
The methodology involves a contrastive learning approach that integrates pairwise supervision into the embedding objective. A signed, normalized contrastive Laplacian is constructed to attract similar node pairs and repel dissimilar ones, while an efficient approximation of the modularity gradient is used to facilitate fast optimization without deep learning architectures.
Results
Contrastive FUSE was evaluated on benchmark datasets, showing that it achieves competitive or superior node classification performance compared to existing contrastive learning methods, all while significantly reducing runtime, making it suitable for large-scale applications.
Implications
The proposed framework has potential applications in various domains, including social networks, biological interaction networks, and recommendation systems, particularly in scenarios where node features are sparse or unavailable. It enables efficient learning of node embeddings directly from graph structure and pairwise relationships.
Post-Trained MoE Can Skip Half Experts via Self-Distillation
NLP
Large Language Models
Efficient ML
- Introduces ZEDA, a framework for adapting post-trained MoE models to dynamic MoE models.
- Utilizes zero-output experts to enhance routing efficiency without sacrificing performance.
- Achieves over 50% reduction in expert FLOPs with marginal accuracy loss.
- Outperforms existing dynamic MoE baselines by significant margins.
Read more
Post-Trained MoE Can Skip Half Experts via Self-Distillation
Summary
This paper presents a novel framework called Zero-Expert Self-Distillation Adaptation (ZEDA) aimed at transforming post-trained Mixture-of-Experts (MoE) models into efficient dynamic MoE models. The authors highlight that while dynamic MoE architectures can significantly reduce computation by activating only a subset of experts based on input, existing methods typically require pre-training from scratch or task-specific adaptations. ZEDA addresses this gap by injecting parameter-free zero-output experts into the MoE layers of a pre-trained model, allowing for a more efficient routing mechanism without compromising the model's performance. The adaptation process involves a two-stage self-distillation approach, where the original MoE serves as a frozen teacher model. Additionally, a group-level balancing loss is introduced to stabilize the adaptation process. Experimental results demonstrate that ZEDA can reduce expert FLOPs by over 50% with minimal accuracy loss, outperforming existing dynamic MoE baselines and achieving approximately 1.20x speedup in end-to-end inference across various benchmarks, including math, code, and instruction-following tasks.
Methodology
The methodology involves injecting zero-output experts into the MoE architecture of a pre-trained model, followed by a two-stage self-distillation process. The original MoE model acts as a frozen teacher, and a group-level balancing loss is applied to regulate the activation frequencies of normal and zero experts, ensuring stability during adaptation.
Results
ZEDA successfully reduces the number of active experts by over 50% while maintaining performance, achieving notable improvements over the strongest dynamic MoE baseline by 6.1 and 4.0 points on two tested models. The framework also provides a significant end-to-end inference speedup of approximately 1.20x.
Implications
The findings suggest that ZEDA can facilitate the practical deployment of large language models by reducing computational costs during inference, making it a valuable approach for applications requiring efficient processing of language tasks.
Graph Hierarchical Recurrence for Long-Range Generalization
Graph Learning
- Introduction of Graph Hierarchical Recurrence (GHR) for improved long-range generalization in graph learning.
- GHR combines dual-level recurrent architecture with hierarchical pooling to enhance information propagation.
- Demonstrated strong performance on long-range benchmarks with significantly fewer parameters compared to state-of-the-art models.
- GHR effectively addresses both in-range and out-of-range generalization challenges.
Read more
Graph Hierarchical Recurrence for Long-Range Generalization
Summary
This paper introduces Graph Hierarchical Recurrence (GHR), a novel framework designed to enhance the performance of Graph Neural Networks (GNNs) and Graph Transformers (GTs) in tasks requiring long-range generalization. Traditional GNNs face challenges in capturing correlations across distant regions of a graph, particularly in out-of-range generalization scenarios where test instances involve interactions beyond the training distances. GHR addresses these limitations by combining a dual-level recurrent architecture with hierarchical pooling, enabling efficient long-range information propagation while preserving the graph's topological structure. The framework operates through two message-passing streams: a high-level stream on a pooled graph and a low-level stream on the original graph, allowing for effective long-range dependencies and improved parameter efficiency. The authors validate GHR through extensive experiments on long-range benchmarks, demonstrating that it outperforms existing models while utilizing significantly fewer parameters, suggesting that increased model capacity alone is insufficient for achieving better generalization.
Methodology
The methodology involves a dual-level recurrent architecture that integrates hierarchical pooling. GHR employs two coupled message-passing processes: one operates on the original graph while the other works on a pooled abstraction. This design allows for efficient long-range propagation and preserves the inductive bias of the graph's topology. The framework alternates between pooling and unpooling within a recurrent loop, facilitating deep computation with minimal parameters.
Results
GHR consistently outperformed existing graph models across various long-range benchmarks, achieving state-of-the-art or competitive performance while using as little as 1% of the parameters of current leading models. Ablation studies confirmed the necessity of each component of GHR for effective out-of-range generalization.
Implications
The findings suggest a shift in focus from merely scaling model architectures to developing more efficient frameworks that enhance generalization capabilities in graph learning. GHR's approach could be applied to various domains requiring graph-based reasoning, such as social network analysis, biological networks, and knowledge graph applications.
What Makes a Representation Good for Single-Cell Perturbation Prediction?
Generative Models
Theory
Interpretability
- Introduces the Perturbation Suppression Hypothesis, highlighting the dominance of invariant information in gene expression data.
- Proposes PerturbedVAE, a framework that separates perturbation-specific signals from invariant structures.
- Includes an identifiability analysis for reliable recovery of perturbation effects.
- Achieves state-of-the-art performance on single-cell perturbation benchmarks.
Read more
What Makes a Representation Good for Single-Cell Perturbation Prediction?
Summary
This paper addresses the challenge of single-cell perturbation modeling, which is crucial for predicting cellular responses to genetic perturbations. The authors identify a significant issue in existing methods: the entanglement of perturbation-invariant information with sparse perturbation-specific signals, leading to ineffective predictions. To tackle this, they propose a novel framework called PerturbedVAE, which explicitly separates perturbation-specific information from invariant structures. This framework incorporates an alignment-based component to isolate relevant signals and a latent causal model to facilitate generalization and interpretability. The authors also present an identifiability analysis that outlines the conditions necessary for reliably recovering sparse perturbation effects. Empirical results demonstrate that PerturbedVAE outperforms existing methods on benchmark datasets, particularly excelling in generalization to unseen perturbations and providing interpretable insights into perturbation-response mechanisms.
Methodology
The authors developed PerturbedVAE, which combines an alignment-based component to isolate perturbation-specific information from invariant cellular programs and a latent causal model to organize this information for effective generalization. They also conducted an identifiability analysis to define conditions for reliable signal recovery.
Results
PerturbedVAE demonstrated significant improvements over strong baseline methods across multiple single-cell perturbation benchmarks, particularly in generalizing to unseen combinatorial perturbations and yielding interpretable representations of perturbation effects.
Implications
The findings suggest that separating perturbation-specific information from invariant signals can enhance predictive modeling in single-cell biology, potentially leading to better understanding and manipulation of cellular responses in therapeutic contexts.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
Large Language Models
Multimodal
Efficient ML
- OScaR addresses the limitations of per-channel quantization by mitigating Token Norm Imbalance (TNI).
- The framework employs Canalized Rotation and Omni-Token Scaling for efficient KV cache compression.
- OScaR achieves significant improvements in decoding speed, memory efficiency, and throughput.
- The methodology is lightweight and does not require complex quantization pipelines.
Read more
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
Summary
The paper addresses the memory bottleneck caused by Key-Value (KV) caches in large language models (LLMs) and proposes a novel quantization framework called OScaR (Omni-Scaled Canalized Rotation). The authors identify Token Norm Imbalance (TNI) as a critical limitation of existing per-channel quantization methods, which leads to increased quantization errors when shared parameters are used across token groups with varying norms. OScaR improves upon traditional methods by employing a two-step approach: Canalized Rotation to prevent scaling-induced outlier artifacts and Omni-Token Scaling for sequence-level normalization. This streamlined framework is designed to be lightweight and efficient, eliminating the need for complex quantization pipelines. Extensive evaluations demonstrate that OScaR achieves near-lossless performance under INT2 quantization, significantly enhancing decoding speed (up to 3.0Γ faster), reducing memory footprint (by 5.3Γ), and increasing throughput (by 4.1Γ) compared to existing benchmarks. The authors provide empirical and theoretical validation of their approach, establishing OScaR as a robust solution for KV cache quantization in various LLM applications.
Methodology
The authors propose OScaR, which consists of two main components: Canalized Rotation, which applies a Hadamard transform to prevent scaling-induced outlier artifacts, and Omni-Token Scaling, which normalizes token groups to address TNI. This approach is designed to be training-free and computationally efficient.
Results
OScaR outperforms existing KV cache quantization methods, achieving up to 3.0Γ speedup in decoding, a 5.3Γ reduction in memory footprint, and a 4.1Γ increase in throughput under INT2 quantization, while maintaining near-lossless performance.
Implications
The proposed OScaR framework can significantly enhance the deployment of large language models and multi-modal systems by improving memory efficiency and inference speed, making it suitable for applications requiring long-context processing and real-time performance.
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
NLP
Large Language Models
Optimization
- Prompt-boundary directional alignment can predict effective rank-1 steering directions, enabling efficient layer selection.
- Rank-1 steering is framed as a budget-constrained optimization problem, allowing for geometry-guided search that reduces search costs.
- Concept granularity measures directional heterogeneity and predicts optimization difficulty and steering performance.
- GRACE framework provides a systematic approach to diagnose steering difficulties and allocate optimization efforts effectively.
Read more
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Summary
This paper addresses the challenges of rank-1 steering in large language models (LLMs), which allows for lightweight control without retraining. The authors argue that the variability in steering effectiveness is often due to the difficulty of finding useful interventions rather than the absence of effective steering directions. They formalize rank-1 steering as a budget-constrained optimization problem, focusing on the intervention layer and coefficient. By employing prompt-boundary directional alignment, they propose a geometry-guided search method that significantly reduces the number of evaluations needed to achieve high utility interventions. The concept of granularity is introduced to measure directional heterogeneity across contexts, helping to explain why some concepts are more challenging to steer than others. The authors present GRACE, a framework that utilizes activation geometry to diagnose steering difficulties and optimize search efforts efficiently. Their findings shift the perspective on activation steering from failure analysis to understanding when steering is effective and stable, making activation geometry a practical tool for LLM control.
Methodology
The authors formalize rank-1 steering as a budget-constrained optimization problem, utilizing prompt-boundary directional alignment to guide the search for effective interventions. They introduce the concept of granularity to assess directional heterogeneity across contexts and employ geometry-guided Bayesian optimization to enhance search efficiency.
Results
The proposed geometry-guided search method reduces the trials needed to recover 95% of the best-found utility by an average of 39.8% across three model families. Additionally, the concept granularity metric correlates with both the convergence speed and the quality of steering performance, indicating its effectiveness in predicting steering challenges.
Implications
The findings suggest that understanding the geometry of activation spaces can lead to more efficient control of LLMs, improving their reliability and safety in deployment. The GRACE framework can be applied to various steering tasks, enhancing the interpretability and effectiveness of interventions in LLMs.
Anytime and Difficulty-Adaptive PAC-Bayes for Constrained Density-Ratio Network with Continual Learning Guarantees
Theory
- Introduces a constrained density-ratio network for learning under covariate shift.
- Combines importance-weighted empirical risk with PAC-Bayes generalization guarantees.
- Imposes structural constraints to ensure calibration and stability in learning.
- Validated through controlled and real-data experiments, showing superior performance.
Read more
Anytime and Difficulty-Adaptive PAC-Bayes for Constrained Density-Ratio Network with Continual Learning Guarantees
Summary
This paper presents a novel framework for learning under covariate shift using a constrained density-ratio network that approximates the Radon-Nikodym derivative of a target distribution with respect to a source distribution. The proposed method integrates importance-weighted empirical risk minimization with a PAC-Bayes generalization certificate, allowing for both fixed-time and anytime learning scenarios. The framework imposes structural constraints on the density-ratio network to ensure calibration and stability, which are critical for effective learning under distribution shifts. The methodology is validated through a two-campaign protocol involving both controlled tests against analytic ground truth and real-data applications, demonstrating improved performance over traditional methods. The results indicate that the network effectively calibrates covariate ratios and reduces target loss, confirming the operational validity of the framework's assumptions.
Methodology
The methodology involves a constrained density-ratio network that approximates the Radon-Nikodym derivative, integrating an importance-weighted empirical risk with a PAC-Bayes framework. Structural identities of the Radon-Nikodym derivative are enforced through an augmented-Lagrangian primal-dual scheme, allowing for both fixed-time and anytime learning scenarios. The framework employs a change-of-measure identity to decompose risk estimation errors into bias and generalization gap components.
Results
The framework successfully produces calibrated covariate ratios on real data, significantly reducing the target 0/1 loss compared to unweighted empirical risk minimization and classical direct ratio-estimation methods. The anytime certificate is achieved as promised, with one recorded failure that aligns with the label shift magnitude, confirming the framework's assumptions.
Implications
This work has significant implications for applications in domains where distribution shifts are common, such as online learning, adaptive systems, and real-time data analysis. The ability to maintain performance under changing data distributions can enhance the robustness and reliability of machine learning models in practical scenarios.
Online Market Making and the Value of Observing the Order Book
Theory
Optimization
- Introduction of an action-dependent feedback model for online market making.
- Achieves O(βT) regret in stochastic settings without smoothness assumptions.
- Extends results to mean-reverting price processes with relaxed conditions.
- Proposes an explore-then-perturb algorithm for adversarial settings achieving O(T^(2/3)) regret.
Read more
Online Market Making and the Value of Observing the Order Book
Summary
This paper addresses the online market-making problem where a learner sequentially sets bid and ask prices for an asset while interacting with traders who have private valuations. The authors introduce an innovative action-dependent feedback model that reflects the dynamics of real limit order books, where informative feedback is provided when no trade occurs. This model contrasts with traditional online learning frameworks that typically assume fully censored feedback. The paper presents an elimination-based algorithm that achieves O(βT) regret in stochastic settings with independent and identically distributed (i.i.d.) market prices, without requiring smoothness assumptions on trader valuations. The authors extend their findings to mean-reverting price processes, demonstrating that similar regret bounds can be achieved under broader conditions. In adversarial settings with oblivious prices, they propose an explore-then-perturb algorithm that guarantees O(T^(2/3)) regret in expectation. The results highlight the significant advantages of observing the order book, showing that even limited feedback can enhance learning outcomes compared to standard bandit feedback models.
Methodology
The authors develop an action-dependent feedback model that allows the learner to receive informative feedback based on the bid-ask spread chosen. They propose algorithms for different market conditions: an elimination-based algorithm for stochastic prices, an extension for mean-reverting processes, and an explore-then-perturb algorithm for adversarial price sequences. The analysis includes deriving new concentration inequalities to support their regret bounds.
Results
The paper establishes that the proposed algorithms achieve O(βT) regret in stochastic settings and O(T^(2/3)) regret in adversarial settings. The results demonstrate that the action-dependent feedback significantly enhances the learning process compared to traditional bandit feedback models, particularly in capturing the dynamics of real-world market making.
Implications
The findings suggest that market makers can improve their pricing strategies and reduce regret by leveraging the information from order books. This has practical implications for financial institutions and trading firms that rely on market making for liquidity provision and price stabilization.
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
NLP
Large Language Models
Reinforcement Learning
- Introduces TEMPO, a method to enforce temporal discipline in LLMs during backtesting.
- Utilizes a two-mode reward system to eliminate knowledge leakage before optimizing performance.
- Employs a GRPO-based training pipeline to help models learn valid reasoning strategies.
- Achieves significant reductions in leakage rates and improvements in task performance.
Read more
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Summary
The paper introduces TEMPO, a novel approach designed to address the issue of temporal knowledge leakage in large language models (LLMs) during backtesting on historical events. Traditional backtesting methods often fail to ensure that models do not utilize post-cutoff information, leading to inflated accuracy metrics. The authors argue that existing methods, such as prompt-based constraints and knowledge unlearning, do not adequately train models to discern temporally valid evidence for specific cutoff dates. TEMPO proposes a two-mode reward system that first eliminates leakage through a leakage mode before optimizing task performance in a performance mode. This is achieved using a Group Relative Policy Optimization (GRPO)-based training pipeline, which allows models to develop reasoning strategies that adhere to temporal constraints. The authors demonstrate that TEMPO effectively reduces leakage while improving task performance across various prediction tasks, establishing a framework for training models to maintain temporal discipline in their reasoning.
Methodology
TEMPO employs a reinforcement learning framework with a two-mode reward system. The first mode focuses on eliminating temporal leakage by driving post-cutoff claims to zero, while the second mode optimizes task performance under the constraint of temporal compliance. The training utilizes Group Relative Policy Optimization (GRPO) with LoRA adapters to facilitate the discovery of temporally valid reasoning strategies.
Results
TEMPO successfully reduces leakage rates from 2-13% to 0.6-3.7% across various conditions in three prediction tasks (stock ranking, salary estimation, legal outcome prediction). Additionally, task performance improves by 6-13% in scenarios where strong pre-cutoff signals are present, while performance is maintained in inherently difficult prediction tasks.
Implications
The findings suggest that TEMPO can enhance the reliability of LLMs in predictive tasks across various domains, such as finance and law, by ensuring that models reason based solely on pre-cutoff information. This has significant implications for the validity of model evaluations and the development of trustworthy AI systems.
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Theory
Large Language Models
Efficient ML
- Establishes a precise asymmetry constant for sign-flip versus magnitude perturbations in ReLU + RMSNorm models.
- Characterizes ternary quantization error and its impact on model performance.
- Demonstrates that ReLU activation creates directional asymmetry affecting output energy.
- Identifies the role of outlier features in amplifying sign sensitivity in real models.
Read more
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Summary
This paper explores the geometric properties of sign-magnitude asymmetry in pre-norm Transformer architectures, specifically focusing on the ReLU + RMSNorm block under ternary quantization. The author provides a theoretical framework that explains why these architectures can tolerate ternary weight quantization ({β1, 0, +1}) with minimal performance loss. The analysis reveals that sign-flips in weights produce significantly more transverse output energy compared to magnitude perturbations, quantified by a constant ratio of Ο/(Ο β 2) β 2.75. This asymmetry arises due to the directional properties introduced by the ReLU activation function, which creates a hidden-space asymmetry between perturbation types. The paper also characterizes the quantization error, demonstrating that it behaves as a sign-preserving perturbation with a specific angular alignment. Experimental results indicate that the theoretical amplification factor does not compound across layers as expected, and outlier features in the model contribute to discrepancies between theoretical predictions and empirical observations. Overall, the findings provide a geometric understanding of the mechanisms underlying the robustness of pre-norm models to ternary quantization.
Methodology
The author employs a theoretical approach involving geometric analysis and perturbation theory to derive results about sign-magnitude asymmetry. The analysis includes the use of i.i.d. Gaussian weights and examines the energy output from perturbations in a two-layer ReLU + RMSNorm model. Experimental validation is conducted to assess the theoretical predictions against real model behavior.
Results
The paper presents several key results, including a derived asymmetry constant of Ο/(Ο β 2) for the energy output from sign-flip perturbations compared to magnitude perturbations. It also shows that the angular alignment of quantization error converges to 2/Ο as input dimensions grow. Experimental findings reveal that the expected amplification factor does not compound across layers, and outlier features significantly affect the observed sign sensitivity.
Implications
The findings have implications for the design and optimization of low-bit neural networks, particularly in understanding how weight quantization affects model performance. The geometric insights could inform future architectures that leverage ternary quantization while maintaining linguistic capabilities, potentially leading to more efficient large language models.
Chessformer: A Unified Architecture for Chess Modeling
Interpretability
- Chessformer is a unified architecture that improves chess modeling across multiple objectives.
- Achieved a state-of-the-art move-matching accuracy of 57.1% with a significantly smaller model.
- Increased the playing strength of Leela Chess Zero by over 100 Elo points, defeating Stockfish.
- Introduced Geometric Attention Bias (GAB) for better adaptation to chess geometry.
Read more
Chessformer: A Unified Architecture for Chess Modeling
Summary
Chessformer presents a unified architecture designed to enhance chess modeling by addressing three primary objectives: maximizing playing strength, predicting human moves, and enabling interpretability. The architecture is an encoder-only transformer that treats chessboard squares as tokens and incorporates a novel dynamic positional encoding method called Geometric Attention Bias (GAB). This allows the model to adapt its attention mechanisms to the geometric structure of chess positions. The authors evaluate Chessformer through three main contributions: the development of MAIA-3, a model for human move prediction achieving a 57.1% move-matching accuracy with significantly fewer parameters than previous models; integration into the Leela Chess Zero engine, which resulted in an increase of over 100 Elo points in playing strength and victories against Stockfish; and enhanced interpretability through its square-token design, which allows for granular analysis of attention patterns related to chess concepts. The findings suggest that aligning model architecture with domain-specific structures can lead to improvements in performance, human compatibility, and interpretability, making Chessformer a significant advancement in chess AI.
Methodology
Chessformer employs an encoder-only transformer architecture that tokenizes chessboard squares and utilizes Geometric Attention Bias (GAB) for dynamic positional encoding. It integrates a source-destination policy head for action prediction and is evaluated through its performance in human move prediction, integration with existing chess engines, and interpretability analyses.
Results
The Chessformer architecture demonstrated a 57.1% move-matching accuracy in human move prediction, surpassing the previous best of 55.9% with a much smaller model. It also enhanced the Leela Chess Zero engine's playing strength by over 100 Elo points, leading to victories against Stockfish in major competitions. The interpretability analysis revealed that the model effectively captures both well-known and nuanced chess concepts.
Implications
The findings suggest that a unified architecture can effectively address multiple objectives in AI modeling, particularly in complex domains like chess. This approach could be extended to other areas requiring high performance and human compatibility, potentially transforming how AI systems are designed and evaluated.
Take It or Leave It: Intent-Controlled Partial Optimal Transport
Optimization
Theory
Multimodal
- Introduces intent-controlled partial optimal transport (IC-POT) for structured rejection mechanisms.
- Replaces global rejection with pointwise rejection costs based on local information.
- Demonstrates practical relevance in positive-unlabeled learning and open-partial domain adaptation.
- Shows improvements in performance through empirical experiments compared to traditional methods.
Read more
Take It or Leave It: Intent-Controlled Partial Optimal Transport
Summary
This paper introduces intent-controlled partial optimal transport (IC-POT), a novel extension of partial optimal transport (POT) that allows for pointwise rejection costs rather than a global rejection mechanism. Traditional optimal transport requires exact matching of measures, which can lead to spurious correspondences in applications. IC-POT addresses this by enabling local pricing of unmatched mass based on side-specific reliability and geometrical considerations. The authors demonstrate that IC-POT can be formulated as a balanced Kantorovich optimal transport problem on an augmented support, allowing for dual interpretations and efficient solutions. The practical applications of IC-POT are showcased in positive-unlabeled learning and open-partial domain adaptation, where it significantly improves performance by leveraging statistical structures. Additionally, a geophysical case study involving multi-modal satellite ocean measurements illustrates how physical priors can inform rejection mechanisms, enhancing the comparability of signal information. The paper concludes that the proposed method provides a more flexible and structured approach to handling unmatched mass in transport problems, with empirical evidence supporting its advantages over traditional constant-cost partial optimal transport methods.
Methodology
The authors develop IC-POT by introducing explicit slack variables for both source and target measures, allowing for pointwise rejection costs. The optimization problem is reformulated as a balanced Kantorovich problem, enabling dual interpretations and efficient computation. Empirical evaluations are conducted across various applications, including positive-unlabeled learning, open-partial domain adaptation, and geophysical comparisons.
Results
The results indicate that IC-POT outperforms traditional constant-cost partial optimal transport in scenarios where local rejection mechanisms are crucial. In experiments, the incorporation of side information led to improved statistical performance in both positive-unlabeled learning and open-partial domain adaptation tasks. The geophysical case study further demonstrated the effectiveness of IC-POT in leveraging physical priors for better signal comparison.
Implications
IC-POT has significant implications for applications requiring nuanced handling of unmatched mass, such as in machine learning tasks involving uncertain data, domain adaptation, and geophysical measurements. Its ability to incorporate local information can lead to more accurate and meaningful comparisons in various fields.
Protein Fold Classification at Scale: Benchmarking and Pretraining
Theory
Generative Models
Graph Learning
- Introduction of TEDBench, a large-scale benchmark for protein fold classification.
- Development of Masked Invariant Autoencoders (MiAE) for protein structure representation learning.
- MiAE achieves significant performance improvements over existing models on TEDBench.
- The benchmark and methods provide a foundation for future research in protein classification.
Read more
Protein Fold Classification at Scale: Benchmarking and Pretraining
Summary
This paper addresses the challenge of protein fold classification, which is crucial for understanding biological functions but hindered by the lack of large-scale benchmarks and scalable models. The authors introduce TEDBench, a comprehensive benchmark for protein fold classification, constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. TEDBench consists of over 462,000 predicted structures and 27,638 experimental structures, significantly surpassing existing datasets. The authors identify that current protein representation learning methods either require large models or perform poorly on TEDBench. To overcome these limitations, they propose Masked Invariant Autoencoders (MiAE), a self-supervised framework that employs a high masking ratio of up to 90% and utilizes an SE(3)-invariant encoder with a lightweight decoder. MiAE demonstrates superior performance, achieving a macro-F1 score of 70.44 after self-supervised pretraining, outperforming traditional supervised models and state-of-the-art baselines while using fewer parameters. This work establishes a strong reference for future research in protein fold classification and highlights the potential of self-supervised learning in structural biology.
Methodology
The authors constructed TEDBench by decomposing the AlphaFold Database into domain units and mapping them to CATH categories. They proposed MiAE, a self-supervised learning framework that masks a high percentage of structural frames and trains an encoder-decoder architecture to reconstruct the masked coordinates, enabling efficient scaling and robust representation learning.
Results
MiAE improved the macro-F1 score by 10.23 points over existing equivariant baselines when trained from scratch and achieved up to 70.44 macro-F1 under linear probing after self-supervised pretraining. This performance surpasses that of traditional supervised models and state-of-the-art pretrained models.
Implications
The introduction of TEDBench and MiAE has significant implications for the field of structural biology, providing a scalable framework for protein fold classification and paving the way for advancements in protein representation learning. This work could facilitate better understanding of protein functions and accelerate research in drug discovery and bioinformatics.
From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models
Theory
Optimization
- Introduction of CGMPINN, combining GMM with dynamic curriculum learning for improved PINN training.
- The framework quantifies spatially varying learning difficulty and adapts training focus accordingly.
- Theoretical guarantees established for convergence and generalization.
- Experimental validation shows significant error reduction compared to standard PINNs.
Read more
From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models
Summary
This paper introduces the Curriculum-Guided Gaussian Mixture Physics-Informed Neural Network (CGMPINN), a novel framework that enhances the training of physics-informed neural networks (PINNs) for solving partial differential equations (PDEs). Traditional PINNs face challenges such as gradient pathologies and poor convergence, particularly in complex problems with nonlinearity and multiscale features. CGMPINN addresses these issues by integrating Gaussian mixture modeling (GMM) with dynamic curriculum learning. The GMM is used to analyze the distribution of PDE residuals, allowing the model to identify regions of varying learning difficulty. A smooth curriculum schedule is implemented to progressively shift the training focus from easier to harder regions, while a precision-based variance modulation technique suppresses unreliable clusters during early optimization. The authors provide theoretical guarantees for CGMPINN, including sublinear convergence of the gradient norm and a generalization bound. Experimental results demonstrate that CGMPINN significantly outperforms standard PINNs, achieving up to a 97.8% reduction in relative L2 error across various benchmark PDEs, while maintaining comparable computational costs.
Methodology
CGMPINN employs Gaussian mixture modeling to analyze the residual distribution of PDEs, allowing for the identification of regions with varying learning difficulty. A dynamic curriculum learning approach is implemented, where training focus is adjusted from simpler to more complex regions. Additionally, a precision-based variance modulation technique is used to filter out unreliable clusters during the initial optimization phase. The entire process is governed by a shared curriculum parameter, which can be integrated with self-adaptive loss balancing.
Results
The experiments conducted on six benchmark PDEs, including elliptic, parabolic, hyperbolic, advection-dominated, and nonlinear reaction-diffusion types, show that CGMPINN consistently achieves the lowest relative L2 and maximum absolute errors compared to other methods. Notably, it reduces the relative L2 error by up to 97.8% over standard PINNs while maintaining similar computational costs.
Implications
The CGMPINN framework has significant implications for solving complex PDEs in various scientific and engineering fields, potentially leading to more accurate and efficient solutions in fluid mechanics, biomedical engineering, and other applications that rely on PDE modeling. The integration of GMM with curriculum learning could also inspire further research into adaptive learning strategies in neural networks.
EviTrack: Selection over Sampling for Delayed Disambiguation
Time Series
Theory
Efficient ML
- EviTrack introduces a test-time inference framework that operates over latent trajectories rather than marginal states.
- The framework maintains competing trajectory hypotheses and delays commitment until sufficient evidence is available.
- EviTrack outperforms traditional sampling-based methods in scenarios of delayed disambiguation.
- The study emphasizes the importance of trajectory-level selection for effective sequential prediction.
Read more
EviTrack: Selection over Sampling for Delayed Disambiguation
Summary
EviTrack presents a novel framework for sequential prediction in scenarios characterized by delayed disambiguation, where early observations are ambiguous and multiple latent explanations remain plausible until sufficient evidence is gathered. Traditional methods, which rely on marginal inference, often collapse uncertainty too early or fail to recover once informative evidence is available. EviTrack addresses this by maintaining a set of competing trajectory hypotheses and applying evidence- and likelihood-ratio-based selection to delay commitment until data supports a decision. The framework is evaluated using a controlled synthetic benchmark that explicitly demonstrates delayed disambiguation, allowing for rigorous assessment of its performance. Results show that EviTrack significantly outperforms sampling-based baselines, achieving faster recovery post-disambiguation. This highlights the effectiveness of trajectory-level selection over increased sampling coverage, establishing selection over sampling as a critical principle for reliable sequential inference.
Methodology
EviTrack employs a trajectory-level evidence tracking approach, maintaining a bounded set of candidate latent trajectory prefixes. It scores these hypotheses based on accumulated predictive evidence and prunes implausible ones, allowing for ambiguity to persist until data becomes informative. The framework is rigorously evaluated using a synthetic benchmark designed to exhibit delayed disambiguation, where ground-truth latent posteriors are known.
Results
Under matched inference budgets, EviTrack significantly outperformed sequential importance sampling and bootstrap particle filtering across various metrics, including forecasting, filtering, and hypothesis quality. Ablation studies indicated that the performance gains were robust across different scoring functions and budget allocations, with EviTrack continuing to outperform particle filtering methods even with reduced computational resources.
Implications
The findings suggest that EviTrack could be applied in various real-world sequential prediction tasks where delayed disambiguation is prevalent, such as in robotics, autonomous systems, and time-series forecasting. The framework's ability to maintain uncertainty and delay commitment until sufficient evidence is available could lead to more reliable predictions in complex environments.
In-context learning enables continental-scale subsurface temperature prediction from sparse local observations
Theory
Interpretability
Optimization
- Introduces In-Context Earth, a transformer-based model for subsurface temperature prediction.
- Achieves a mean absolute error of 4.7 Β°C, outperforming traditional models.
- Maintains high accuracy in diverse geological regions without fine-tuning.
- Learns internal representations of subsurface properties, enhancing interpretability.
Read more
In-context learning enables continental-scale subsurface temperature prediction from sparse local observations
Summary
This paper addresses the challenge of predicting continental-scale subsurface temperatures from sparse borehole measurements, which are crucial for geothermal resource assessment and understanding heat transport in the Earth's crust. The authors introduce a transformer-based model called In-Context Earth, which leverages sparse local observations to predict continuous temperature-at-depth fields while providing calibrated uncertainty estimates. The model demonstrates superior performance compared to existing methods, achieving a mean absolute error of 4.7 Β°C in the contiguous United States and maintaining high accuracy in other regions like Alberta, Australia, and the UK without the need for fine-tuning. The model's interpretability analyses reveal that it learns representations of subsurface properties, such as seismic velocities and geochemistry, even without direct training on these features. This work illustrates the potential of in-context learning to enhance subsurface characterization using limited data, paving the way for improved geothermal resource mapping and other geoscience applications.
Methodology
The authors developed a transformer-based model that utilizes sparse local borehole observations as geological context to predict temperature-at-depth fields. The model incorporates innovations such as Earth-tailored data augmentation and multiscale positional encodings, allowing it to adapt to different geological regions without retraining.
Results
The In-Context Earth model achieved a mean absolute error of 4.7 Β°C in the contiguous United States and maintained errors of 2.2 Β°C in Alberta, 6.2 Β°C in Australia, and 5.4 Β°C in the UK. The model's uncertainty estimates were well calibrated, with a KolmogorovβSmirnov statistic of 2.5%.
Implications
This research has significant implications for geothermal resource assessment and subsurface characterization, enabling more accurate mapping of thermal regimes across continental scales. It highlights the potential of deep learning techniques to overcome data sparsity challenges in geosciences.
Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand
Graph Learning
Time Series
Optimization
- BRIDGE addresses urban delivery demand forecasting in cold-start regions with limited historical data.
- The framework integrates a contextual graph backbone with a retrieval mechanism for future demand patterns.
- A future-aware training objective enhances the retriever's effectiveness in aligning with forecasting needs.
- Experiments demonstrate consistent performance improvements over traditional spatiotemporal forecasting models.
Read more
Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand
Summary
This paper addresses the challenge of forecasting urban delivery demand in newly added service regions that lack historical data. Traditional spatiotemporal forecasting models struggle in these cold-start scenarios due to their parametric nature, which limits their ability to capture short-term operational dynamics. The authors propose BRIDGE, a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory of region-time windows. BRIDGE retrieves future demand patterns based on regional context and recent dynamics, refining forecasts through a gated fusion mechanism. The retriever is trained with a future-aware objective to prioritize entries that align closely with the target's future trajectories. The effectiveness of BRIDGE is demonstrated through experiments on four real-world delivery datasets, showing significant improvements over existing spatiotemporal baselines in both within-city cold-start and cross-city transfer scenarios. This approach highlights the importance of retrieval augmentation in enhancing operational memory for urban demand forecasting, particularly in situations where conventional parametric models fall short.
Methodology
BRIDGE employs a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory system. It retrieves relevant past demand patterns based on regional context and recent dynamics, refining forecasts through a gated fusion mechanism. The retriever is trained with a future-aware objective to optimize its retrieval process for forecasting utility.
Results
The experiments conducted on four real-world delivery datasets indicate that BRIDGE outperforms competitive spatiotemporal baselines in both cold-start scenarios within a city and cross-city transfer situations with partial observations. The results underscore the effectiveness of retrieval augmentation in improving demand forecasting accuracy.
Implications
The findings suggest that retrieval-augmented approaches can significantly enhance urban delivery demand forecasting, particularly in dynamic environments where historical data is sparse. This has practical implications for logistics and urban planning, enabling better resource allocation and service reliability in newly developed areas.
StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping
Multimodal
- Introduces a physics-guided deep learning framework for sheet metal forming.
- Integrates both geometric and material properties as multimodal inputs.
- Achieves rapid predictions of physical fields in under a second.
- Reduces simulation time from hours to subsecond AI inference.
Read more
StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping
Summary
The paper presents StampFormer, a novel physics-guided deep learning framework designed to enhance the efficiency of sheet metal forming processes by rapidly predicting physical fields such as thinning, strain, and displacement. Traditional methods rely heavily on Finite Element Analysis (FEA), which is time-consuming and computationally expensive, often leading to prolonged design cycles. Existing surrogate models either focus solely on geometry or fail to adequately incorporate material properties, limiting their effectiveness. StampFormer addresses these limitations by integrating both geometric and material data as multimodal inputs. The framework employs a Material-Augmented Geometric Network (MAGN) to fuse these inputs, followed by a Hierarchical Material Embedding Injection Unit (HMEIU) that enhances the integration before processing through an adapted Swin-UNet backbone. The model was evaluated on stamping simulations of steel and aluminum panels, demonstrating its ability to provide high-fidelity predictions in under a second, significantly reducing the simulation time compared to traditional FEA methods. The results indicate an average relative error of less than 8.5% for 2D fields and a mean squared error of less than 1.2 mmΒ² for 3D displacement fields, showcasing the potential of StampFormer for real-time manufacturability assessments during the design cycle.
Methodology
The methodology involves a physics-guided deep learning framework that utilizes a Material-Augmented Geometric Network (MAGN) to fuse geometric and material data. This is followed by a Hierarchical Material Embedding Injection Unit (HMEIU) for enhanced integration, and the primary processing is conducted through an adapted Swin-UNet architecture.
Results
The StampFormer model demonstrated high-fidelity predictions with an average relative error of less than 8.5% for four 2D physical fields and a mean squared error of less than 1.2 mmΒ² for the 3D displacement field, all achieved in under a second.
Implications
The implications of this research include the potential for significantly faster design validation processes in manufacturing, allowing for real-time assessments of manufacturability and reducing costs associated with traditional FEA methods. This could lead to more efficient design cycles and the ability to explore more complex geometries and materials in sheet metal forming.
Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality
Theory
Generative Models
Optimization
- Introduction of OβPRIOR, a compositional realism prior for tabular model pretraining.
- Establishment of a controlled evaluation protocol to isolate the effects of prior design.
- Demonstration of substantial improvements in model performance due to enhanced synthetic task distributions.
- Identification of independent contributions from various realism components in the prior.
Read more
Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality
Summary
This paper investigates the critical role of synthetic task distributions in determining the quality of tabular foundation models (TFMs). Unlike models in language or vision domains, TFMs rely heavily on synthetic pretraining distributions, which are often too idealized and fail to capture the complexities of real-world data. The authors introduce OβPRIOR, a novel compositional realism prior that incorporates a hierarchical structural causal model (SCM) meta-generator, a modular realism engine, and an explicit stress module to enhance the realism of synthetic tasks. By isolating prior design as a variable while keeping architecture and optimization constant, the study demonstrates that OβPRIOR significantly improves downstream accuracy and robustness across various tabular benchmarks, particularly in scenarios with distributional irregularities. The findings highlight the importance of synthetic prior construction as a key factor influencing TFM performance, providing insights for future research and practical applications in tabular data modeling.
Methodology
The authors developed OβPRIOR as a synthetic task generator that employs a hierarchical SCM meta-generator to create diverse functional families. It includes a modular realism engine to introduce realistic features such as missingness and confounding, and a stress module to simulate distribution shifts. The generation process is governed by a curriculum protocol that ensures leakage-safe transformations, allowing for controlled evaluation of the impact of different components on model performance.
Results
OβPRIOR consistently outperformed simpler baseline models across real tabular benchmarks, with significant gains observed in scenarios characterized by distributional irregularities. Ablation studies confirmed that the diversity of mechanisms, the composition of realism, and the incorporation of stress factors each contributed independently to the improved performance, underscoring the importance of synthetic prior design.
Implications
The findings suggest that careful design of synthetic task distributions is crucial for enhancing the robustness and accuracy of tabular foundation models. This has implications for various fields that rely on predictive modeling with tabular data, such as healthcare, finance, and industry, where real-world data often exhibit complexities not captured by traditional synthetic priors.
CLIC: Contextual Language-Informed Cardiac Pathology Classification
Multimodal
Time Series
NLP
- Introduction of CLIC, a multimodal framework for ECG-based cardiac pathology classification.
- Demonstrates the importance of integrating contextual patient data with ECG signals.
- Two configurations are explored: Data-to-Text and Prompt-guided approaches using LLMs.
- Template-based contextual descriptions outperform LLM-generated texts in classification tasks.
Read more
CLIC: Contextual Language-Informed Cardiac Pathology Classification
Summary
The paper presents CLIC, a multimodal framework for cardiac pathology classification that integrates electrocardiogram (ECG) signals with contextual patient data encoded as natural language. Traditional ECG analysis often overlooks the importance of contextual information, which can enhance diagnostic accuracy. CLIC addresses this gap by transforming patient-level metadata into descriptive text, allowing the model to better interpret complex physiological patterns. The authors explore two configurations: a Data-to-Text (DtT) approach and a Prompt-guided approach using Large Language Models (LLMs). They find that while LLM-generated texts are competitive, template-based contextual descriptions yield superior classification performance. The study utilizes the PTB-XL dataset for a multiclass classification task, focusing on five ECG pathology categories. Results indicate that incorporating contextual variables significantly improves classification performance compared to traditional ECG-only methods, with the best outcomes achieved through structured natural language representations.
Methodology
The study formulates the classification task as a supervised multiclass problem using the PTB-XL dataset. It employs a ResNet18 architecture for ECG signal processing and integrates contextual information through two methods: a Data-to-Text approach and a Prompt-guided approach with LLMs. The dataset includes structured metadata, allowing for a comprehensive multimodal analysis.
Results
The incorporation of contextual variables led to improved classification performance compared to ECG-only baselines. The best results were achieved using structured natural language representations, particularly in the Data-to-Text configuration, while LLM-generated texts did not consistently enhance performance.
Implications
The findings suggest that integrating contextual information into ECG analysis can lead to more accurate diagnoses in cardiology. This approach could be applied to other medical domains where multimodal data is available, potentially improving patient outcomes through enhanced decision-making.
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Large Language Models
Efficient ML
Generative Models
- D-PACE introduces adaptive per-position weights for training, improving the efficiency of speculative decoding.
- The method stabilizes training through asymmetric smoothing, preserving token-level gradients.
- D-PACE shows significant improvements in decoding speed and emitted length across multiple benchmarks.
- The approach does not require changes to existing drafter architectures or inference processes.
Read more
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Summary
The paper introduces D-PACE (Dynamic Position-Aware Cross-Entropy), a novel loss function designed to enhance the performance of speculative decoding in large language models (LLMs). Speculative decoding accelerates inference by allowing a smaller drafter model to propose tokens that a larger target model verifies in parallel. Existing multi-token drafter objectives often rely on fixed position-dependent weighting schedules, which do not adapt to the changing acceptance limits during training. D-PACE addresses this limitation by deriving per-position training weights from a differentiable surrogate of expected accepted draft length, aligning the weight of each position with its contribution to the log-probability gradient. This adaptive approach shifts the training signal towards positions that currently limit acceptance, thereby improving the drafter's performance. The authors demonstrate that D-PACE consistently enhances both wall-clock speedup and average emitted length across various benchmarks and configurations without altering the drafter architecture or inference procedure.
Methodology
The authors derive a weighted cross-entropy loss function, D-PACE, from a smooth surrogate of expected accepted draft length. This involves calculating per-position weights that adapt based on the current acceptance limitations during training. The method employs asymmetric smoothing to stabilize the cumulative-product weights while maintaining the integrity of token-level cross-entropy gradients.
Results
D-PACE achieves an average decoding speedup of 8.0% to 9.7% and an average emitted length increase of 8.5% to 10.7% across various configurations of the Qwen3-4B model. For the Llama-3.1-8B and Qwen3-8B models, the average emitted length improves by approximately 12.5% and 12.8%, respectively.
Implications
The D-PACE method has the potential to significantly enhance the efficiency of LLM inference, making it applicable in real-time applications where speed and accuracy are critical. Its adaptive weighting mechanism could also inspire further research into dynamic training strategies for other machine learning models.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
NLP
Large Language Models
Reinforcement Learning
- ReCrit addresses the instability of LLMs in multi-turn critic interactions.
- The framework decomposes correctness transitions into four distinct quadrants.
- Dynamic asynchronous rollout enhances training efficiency and scalability.
- Significant improvements in critic accuracy were observed across multiple scientific reasoning benchmarks.
Read more
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
Summary
The paper introduces ReCrit, a novel transition-aware reinforcement learning framework designed to enhance scientific reasoning in large language models (LLMs) during critic interactions. Traditional models often falter in multi-turn interactions, where user criticism can destabilize initially correct answers. ReCrit addresses this by framing the interaction as a correctness-transition problem rather than merely focusing on final-answer accuracy. It identifies three main challenges: transition awareness, the need to differentiate between useful corrections and harmful sycophancy, and the scalability of training. The framework decomposes the transition from an initial solution to a critic solution into four quadrants: Correction, Sycophancy, Robustness, and Boundary. By rewarding corrections and robustness while penalizing sycophancy, ReCrit effectively manages the transition dynamics. Additionally, it employs dynamic asynchronous rollout to improve training efficiency. The effectiveness of ReCrit is validated through experiments on three scientific reasoning benchmarksβChemBench, TRQA, and EarthSEβdemonstrating significant improvements in critic accuracy across different model sizes.
Methodology
ReCrit employs a transition-aware reinforcement learning approach, focusing on the correctness transition from initial to critic solutions. It uses quadrant-based rewards and penalties to manage the interaction dynamics and incorporates dynamic asynchronous rollout for efficient training.
Results
On the ChemBench, TRQA, and EarthSE benchmarks, ReCrit improved average critic accuracy from 38.15 to 51.49 for the Qwen3.5-4B model and from 45.40 to 55.59 for the Qwen3.5-9B model, demonstrating its effectiveness in enhancing scientific reasoning capabilities.
Implications
The findings suggest that transition-aware reinforcement learning can significantly improve the reliability of LLMs in scientific reasoning tasks, making them more robust against misleading user criticism. This has potential applications in educational tools, scientific research assistants, and any domain requiring critical reasoning.
Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
Optimization
Theory
Efficient ML
- Efficient conditioning is identified as the key property for batch diversity in Bayesian Optimization.
- Gaussian Processes are proven to support efficient conditioning, unlike parametric models.
- A unified framework for CL, KB, and fantasy models is established, linking them to Local Penalization and Determinantal Point Processes.
- The Structural Diversity Diagnostic (SDD) is introduced to assess surrogate model compatibility.
Read more
Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
Summary
This paper addresses the effectiveness of pseudo-observation strategies in batch Bayesian Optimization (BO), specifically focusing on the Constant Liar (CL) and Kriging Believer (KB) models. The authors identify 'efficient conditioning' as a crucial property that allows surrogate models to update predictions in closed form when new data is added. They prove that Gaussian Processes (GPs) satisfy this requirement, leading to distinct batch points during optimization. The paper unifies CL, KB, and fantasy models under a single conditioning mechanism and establishes connections to Local Penalization and Determinantal Point Processes. A new methodology, the Structural Diversity Diagnostic (SDD), is introduced to evaluate surrogate compatibility with pseudo-observation batch selection. The authors validate their theoretical findings through experiments on various optimization tasks, demonstrating that CL/KB can match or outperform explicit Local Penalization and achieve competitive results with joint q-EI. The findings extend to Multiquadric RBF networks and highlight the limitations of parametric surrogates in maintaining batch diversity.
Methodology
The authors develop a theoretical framework to characterize surrogate model requirements for pseudo-observation batch selection. They introduce the Structural Diversity Diagnostic (SDD) to eliminate optimizer randomness and confirm batch diversity as a structural property. Experiments are conducted on benchmark optimization problems to validate theoretical predictions.
Results
The experiments confirm that CL and KB strategies achieve batch diversity effectively, matching or outperforming explicit Local Penalization. The results also show that parametric surrogates struggle to maintain batch diversity, while GPs and Multiquadric RBF networks demonstrate robust performance across various datasets and noise conditions.
Implications
The findings suggest that efficient conditioning is critical for the success of batch Bayesian Optimization, particularly as the field scales beyond Gaussian Processes. This work may influence the design of surrogate models in optimization tasks, especially in scenarios involving expensive evaluations.
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
Graph Learning
Efficient ML
Optimization
- Current graph condensation methods rely on full-dataset training, undermining efficiency.
- Gradient matching introduces high computational overhead and poor generalization across GNNs.
- Existing evaluation protocols do not accurately reflect resource savings and overhead.
- A shift towards lightweight and architecture-agnostic methods is necessary for practical deployment.
Read more
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
Summary
This position paper critiques the current state of graph condensation methods used in Graph Neural Networks (GNNs), highlighting systemic flaws that hinder scalability and efficiency. The authors argue that the prevalent approach of gradient matching necessitates training on the full dataset, which contradicts the goal of creating a condensed, smaller graph. This reliance on full-dataset training leads to high computational costs and poor generalization across different GNN architectures. Furthermore, the authors point out that existing evaluation metrics, such as node compression ratios, do not accurately reflect the true resource savings or the overhead involved in condensation processes. They advocate for a paradigm shift towards lightweight, architecture-agnostic methods that can be practically deployed, emphasizing the need for new research directions that prioritize efficiency and usability in GNN training at scale. By addressing these methodological flaws, the authors aim to reorient the field towards achieving the true potential of graph condensation.
Methodology
The paper employs a critical analysis of existing graph condensation methods, particularly focusing on the gradient matching approach. It identifies key flaws in methodology and evaluation protocols, proposing a reset in the approach to graph condensation that emphasizes lightweight and architecture-agnostic solutions.
Results
The authors do not present experimental results but rather provide a conceptual framework for understanding the limitations of current methods and the need for a paradigm shift in graph condensation practices.
Implications
The proposed shift towards more efficient and generalizable graph condensation methods could significantly enhance the scalability of GNNs in various applications, including recommender systems, fraud detection, and molecular biology, making them more feasible for real-world use.
MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning
Optimization
Theory
Efficient ML
- MANGO effectively balances stability and plasticity in online continual learning.
- The gradient-gating mechanism selectively scales parameter updates based on sensitivity.
- Meta-learned regularization dynamically adapts stability coefficients to prevent catastrophic forgetting.
- MANGO achieves state-of-the-art performance across multiple benchmark datasets.
Read more
MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning
Summary
The paper presents MANGO, a novel framework for Online Continual Learning (OCL) that addresses the stability-plasticity dilemma faced by neural networks learning from non-stationary data streams. Unlike traditional offline continual learning, OCL requires models to learn from data in a single pass with limited memory for past samples, leading to challenges such as catastrophic forgetting. MANGO introduces a gradient-gating mechanism that modulates parameter updates based on their sensitivity, preventing destructive updates to important parameters. Additionally, it employs meta-learned regularization to dynamically adapt stability coefficients, allowing the model to evaluate the impact of updates on past knowledge. The framework treats replay as both a training signal and a forgetting evaluator, enhancing the model's ability to retain knowledge while learning new tasks. Evaluations on benchmark datasets (CIFAR-100, Tiny-ImageNet, and CLEAR-10) demonstrate that MANGO outperforms existing methods, achieving state-of-the-art results and positive Backward Transfer, indicating improved retention of previously learned tasks.
Methodology
MANGO employs a gradient-gating mechanism to modulate parameter updates based on their sensitivity, preventing harmful updates to critical parameters. It also utilizes meta-learned regularization to adaptively learn stability coefficients, evaluating the impact of updates on a replay buffer that serves as both a training signal and a forgetting evaluator.
Results
MANGO outperformed strong baselines, achieving state-of-the-art results on CIFAR-100, Tiny-ImageNet, and CLEAR-10. It surpassed the previous state-of-the-art method LODE by 5.36% on CIFAR-100 and 15.18% on Tiny-ImageNet. Notably, MANGO achieved a positive Backward Transfer of +15.12% on CLEAR-10, indicating that learning new tasks improved retention of previously learned knowledge.
Implications
The findings suggest that MANGO can be applied in scenarios requiring continual learning from streaming data, such as robotics, autonomous systems, and real-time data analysis, where retaining knowledge while adapting to new information is crucial.
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
Theory
- Enforcing Euclidean isotropy in JEPAs can misalign representations with structured task geometries.
- The paper derives the minimax and maximum-entropy covariance under a Hamiltonian energy budget, highlighting the cost of isotropy.
- HamJEPA introduces a phase-space representation that leverages Hamiltonian dynamics for improved predictive coupling.
- Empirical results show significant performance improvements over existing methods on CIFAR-100 and ImageNet-100 datasets.
Read more
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
Summary
This paper critiques the common practice in Joint Embedding Predictive Architectures (JEPAs) of regularizing embeddings towards an isotropic Gaussian distribution, arguing that this approach can misalign representations with the underlying geometry of structured tasks. The author derives the minimax and maximum-entropy covariance for a known structured downstream geometry, revealing that enforcing Euclidean isotropy incurs a significant cost. When the downstream geometry is unknown, the paper asserts that no fixed covariance shape is universally applicable, leading to potential misalignment in predictive coupling. To address these issues, the author introduces HamJEPA, a novel phase-space JEPA that incorporates Hamiltonian structure into the predictive model using a symplectic leapfrog map. This method allows for non-isotropic scaling and spectral controls to prevent representation collapse. Empirical results demonstrate that HamJEPA outperforms existing methods on benchmark datasets, indicating the effectiveness of incorporating Hamiltonian dynamics into the predictive architecture.
Methodology
The paper employs theoretical analysis to derive covariance structures under Hamiltonian constraints and introduces HamJEPA, which encodes views as phase-space states and predicts transitions using a learned Hamiltonian leapfrog map. The methodology includes empirical validation against existing models using benchmark datasets.
Results
HamJEPA achieves improvements of +4.89 kNN@20 and +3.52 linear-probe points on CIFAR-100 at 30 epochs, and +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs. On ImageNet-100, it shows an improvement of +4.82 kNN@20 and +7.52 linear-probe points at 45 epochs, demonstrating the effectiveness of the proposed architecture.
Implications
The findings suggest that incorporating Hamiltonian geometry into predictive architectures can enhance representation learning in structured tasks. This approach may lead to more effective models in various applications, particularly in scenarios where the underlying geometry is complex or unknown.
AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning
NLP
Large Language Models
Optimization
- Introduces AR1-ZO, a method that optimizes LoRA fine-tuning using zeroth-order optimization.
- Identifies and resolves the measurement-topology problem that complicates high-rank LoRA optimization.
- Proposes a topology-aware scaling mechanism that restores rank-invariant active signals.
- Demonstrates the effectiveness of AR1-ZO through theoretical proofs and empirical validation on OPT and Qwen3 models.
Read more
AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning
Summary
This paper introduces AR1-ZO, a novel optimization method that combines zeroth-order (ZO) optimization with Low-Rank Adaptation (LoRA) for fine-tuning large language models (LLMs). The authors address a critical challenge known as the rank paradox, where increasing the LoRA rank enhances adapter capacity but complicates the ZO optimization process. They identify the core issue as a measurement-topology problem rather than a need for external subspaces. By leveraging the inherent structure of LoRA, which decomposes into rank-1 atoms, AR1-ZO optimizes the querying process to maintain a fixed adapter rank while effectively managing the perturbation dimension. The authors propose a topology-aware scaling mechanism that corrects the active finite-difference signal, which is crucial for maintaining performance at high ranks. Theoretical proofs support their claims, and experiments demonstrate that AR1-ZO significantly improves the effectiveness of high-rank LoRA fine-tuning compared to existing ZO methods under standard query budgets.
Methodology
The authors develop AR1-ZO by alternating between querying complete rank-1 atoms of LoRA while applying a topology-aware scaling factor to maintain the active finite-difference signal. This approach allows them to keep the adapter rank constant while reducing the perturbation dimension. Theoretical analysis is complemented by empirical experiments on large language models to validate the proposed method.
Results
The experiments conducted on OPT and Qwen3 models reveal that AR1-ZO outperforms traditional two-point ZO methods, effectively utilizing high-rank LoRA configurations. The results confirm the theoretical predictions regarding the active finite-difference signal and demonstrate improved efficiency and performance in fine-tuning tasks.
Implications
The findings suggest that AR1-ZO can enhance the fine-tuning of large language models under memory constraints, making it a valuable tool for practitioners in NLP and related fields. The method's ability to efficiently manage high-rank adaptations could lead to better model performance in various applications, including text generation and understanding.
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
Reinforcement Learning
Large Language Models
Optimization
- Error diversity within group rollouts is a critical factor for training success in RLVR.
- EDAS is a lightweight, algorithm-agnostic method that reshapes advantage signals based on error diversity.
- The method encourages exploration of diverse reasoning paths and discourages repetitive errors.
- Extensive empirical validation shows significant performance improvements across various benchmarks.
Read more
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
Summary
This paper addresses a significant oversight in Reinforcement Learning from Verifiable Rewards (RLVR), where the diversity of errors in group rollouts is often ignored. The authors present a novel technique called Error Diversity Advantage Shaping (EDAS), which adjusts the advantage signal based on the intra-group error diversity. Their empirical analysis shows that diverse errors within a group can predict training success, with diverse wrong answers leading to better performance improvements compared to homogeneous errors. EDAS amplifies penalties for common errors while reducing penalties for rare ones, promoting exploration and reducing error repetition. The method is lightweight, algorithm-agnostic, and can be integrated into existing RLVR frameworks without altering the core algorithms. Validation across various models and benchmarks demonstrates that EDAS consistently enhances performance, achieving an average improvement of 6.29 points over the DAPO baseline in mathematical reasoning tasks and 1.45 points in code generation tasks. This research highlights the importance of error diversity in RLVR training and provides a practical approach to leverage this information for better learning outcomes.
Methodology
The authors conducted empirical analyses to assess the impact of error diversity on RLVR training efficacy. They developed the EDAS technique to modulate advantage signals based on the distribution of errors within group rollouts. This involved amplifying penalties for frequent errors and attenuating them for rare ones, promoting diverse reasoning paths. The method was validated on multiple RLVR frameworks and various challenging benchmarks.
Results
EDAS demonstrated consistent improvements in model performance, achieving an average increase of 6.29 points over the DAPO baseline in mathematical reasoning tasks and 1.45 points in code generation tasks. The results confirm that leveraging error diversity in group rollouts is an effective strategy for enhancing RLVR training.
Implications
The findings suggest that incorporating error diversity into RLVR training can lead to more robust models capable of better generalization and problem-solving. This approach may have applications in various domains requiring complex reasoning, such as formal mathematics and competitive programming.
Delta Attention Residuals
NLP
Large Language Models
Theory
- Identifies the routing collapse problem in Attention Residuals due to source redundancy.
- Proposes Delta Attention Residuals that route over deltas instead of cumulative states.
- Demonstrates improved attention sharpness and model performance across various scales.
- Enables easy conversion of pretrained models into Delta Attention Residuals via fine-tuning.
Read more
Delta Attention Residuals
Summary
The paper introduces Delta Attention Residuals, a novel approach to improve the routing mechanisms in deep transformer models. Traditional Attention Residuals utilize learned softmax attention over cumulative hidden states, which leads to redundancy and a phenomenon termed 'routing collapse' in deeper layers, where attention weights become nearly uniform and less discriminative. The authors propose to route over deltasβthe changes introduced by each sublayerβrather than cumulative states. This method enhances the diversity of representations and results in sharper attention distributions, allowing for more effective cross-layer routing. The proposed Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals across various model scales, achieving significant reductions in validation perplexity. Additionally, the method allows for the straightforward conversion of pretrained models into Delta Attention Residuals through fine-tuning, demonstrating its practical applicability in improving existing transformer architectures.
Methodology
The authors analyze the limitations of standard Attention Residuals and propose a new routing mechanism that focuses on the deltas (changes) introduced by each sublayer. They implement this mechanism at both per-sublayer and block levels, using an additive formulation that preserves the residual stream while enhancing routing sharpness. Empirical evaluations are conducted across multiple model sizes to assess performance improvements.
Results
Delta Attention Residuals show consistent performance improvements over standard residuals and Attention Residuals, with validation perplexity reductions ranging from 1.7% to 8.2%. The maximum softmax weight for attention routing increases significantly, indicating sharper and more selective attention distributions.
Implications
The findings suggest that routing mechanisms in deep learning models can be significantly improved by focusing on the changes introduced by layers rather than cumulative outputs. This has implications for the design of more efficient and effective transformer architectures, potentially enhancing performance in various NLP tasks.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Optimization
Theory
Efficient ML
- Establishes the generalization error of the Muon optimizer as O(1/(NΞΊT)).
- Introduces the MiMuon optimizer, which improves generalization error to O(1/N).
- Proves that MiMuon maintains the same convergence rate as the Muon optimizer.
- Demonstrates the effectiveness of MiMuon through numerical experiments on large models.
Read more
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Summary
The paper introduces the MiMuon optimizer, an enhancement of the Muon optimizer designed for large-scale models with matrix-structured parameters. While the Muon optimizer has demonstrated faster convergence than traditional vector-wise optimization methods, its generalization properties had not been thoroughly analyzed. This study addresses that gap by establishing the generalization error of the Muon optimizer, proving it to be O(1/(NΞΊT)), where N is the training sample size, T is the iteration number, and ΞΊ is a small positive constant related to the gradient estimate's singular values. To improve generalization, the authors propose the MiMuon optimizer, which incorporates orthogonalization of gradients, resulting in a lower generalization error of O(1/N). The paper also confirms that MiMuon retains the same convergence rate of O(1/T^(1/4)) as the original Muon optimizer. Experimental results on large models, including Qwen3-0.6B and YOLO26m, validate the efficiency of the MiMuon optimizer in practice.
Methodology
The authors utilize algorithmic stability and mathematical induction to analyze the generalization error of the Muon optimizer. They then propose the MiMuon optimizer, which combines features of the Muon and momentum-based SGD optimizers, particularly focusing on the orthogonalization of gradients to enhance generalization. The convergence properties of MiMuon are also studied and compared to those of the Muon optimizer.
Results
The MiMuon optimizer shows a significant improvement in generalization error compared to the Muon optimizer, reducing it from O(1/(NΞΊT)) to O(1/N). Both optimizers exhibit the same convergence rate of O(1/T^(1/4)). Experimental results confirm the efficiency of MiMuon in training large models.
Implications
The findings suggest that the MiMuon optimizer could be a valuable tool for training large-scale AI models, particularly in scenarios where generalization performance is critical. This could lead to more robust AI systems that perform better on unseen data.
LoRA vs. Full Fine-Tuning: A Theoretical Perspective
Theory
Efficient ML
Large Language Models
- LoRA can achieve lower excess risk than FFT under certain conditions, particularly when task differences are low-rank.
- The choice of LoRA rank is crucial for generalization performance, with small ranks sometimes improving test accuracy despite limiting expressivity.
- Theoretical bounds for LoRA's excess risk are established, providing a clearer understanding of its performance compared to FFT.
- Empirical experiments support the theoretical findings, indicating broader applicability to LLM fine-tuning.
Read more
LoRA vs. Full Fine-Tuning: A Theoretical Perspective
Summary
This paper investigates the theoretical underpinnings of Low-Rank Adaptation (LoRA) compared to Full Fine-Tuning (FFT) in the context of adapting pre-trained models to downstream tasks. The authors focus on a linear regression framework to analyze the excess risk associated with both methods. They establish that LoRA can outperform FFT in scenarios where the difference between pretraining and downstream tasks is low-rank. The paper presents novel theoretical bounds for the excess risk of LoRA in both overdetermined and underdetermined settings, revealing that the choice of LoRA rank significantly influences generalization performance. The findings are validated through experiments, suggesting that the insights gained extend beyond linear regression to practical applications in fine-tuning large language models (LLMs).
Methodology
The authors analyze LoRA and FFT within a linear regression framework, deriving closed-form solutions for both methods. They formulate the empirical risk minimization problem and derive excess risk bounds for LoRA in different settings. The theoretical analysis is complemented by controlled experiments to validate the findings.
Results
The study reveals that LoRA can outperform FFT in low-rank task scenarios, with specific rank choices affecting generalization. The theoretical bounds established for LoRA's excess risk provide insights into its behavior compared to FFT, particularly in high-dimensional settings. Experimental results corroborate the theoretical predictions, demonstrating the practical implications of the findings.
Implications
The insights from this study can inform the selection of fine-tuning strategies for large language models, particularly in resource-constrained environments. Understanding the trade-offs between LoRA and FFT can lead to more efficient model adaptations, reducing computational costs while maintaining performance.
PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes
Time Series
- PhysioSeq2Seq combines physiological modeling with Seq2Seq LSTM for improved glucose forecasting.
- The twin matching approach allows for patient-specific adaptation without retraining.
- Incorporating internal ODE state variables as covariates reduces long-horizon prediction bias.
- The model significantly outperforms traditional LSTM and ODE-based approaches in accuracy.
Read more
PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes
Summary
The paper presents PhysioSeq2Seq, a novel hybrid architecture designed for accurate long-horizon glucose forecasting in individuals with Type 1 Diabetes (T1D). Traditional LSTM networks often suffer from systematic negative bias due to error compounding over long prediction horizons, while mechanistic ordinary differential equation (ODE) models struggle with individual variability when parameterized at the population level. PhysioSeq2Seq addresses these issues by combining patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. The architecture employs a twin matching approach to identify the best-fitting physiological model from a population of digital twins based on a patient's continuous glucose monitoring (CGM) history. The internal state variables from the matched ODE model are incorporated as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM, allowing for a simultaneous 48-step prediction that mitigates recursive error accumulation. The model was trained on data from 348 participants and evaluated on 74 held-out participants, achieving significant improvements in forecasting accuracy compared to standard LSTM and ODE models. The findings suggest that integrating patient-matched physiological states into neural models can enhance long-horizon glucose predictions, with potential applications extending beyond T1D management.
Methodology
The PhysioSeq2Seq architecture utilizes a hybrid approach that combines a sequence-to-sequence LSTM with patient-specific physiological modeling. It employs twin matching to select the best-fitting ODE model from a population of digital twins based on a patient's CGM data. The internal state variables of the matched ODE model are injected as exogenous covariates into the Seq2Seq LSTM, allowing for simultaneous multi-step predictions and reducing recursive error accumulation.
Results
At a 240-minute forecasting horizon, PhysioSeq2Seq achieved a mean absolute error (MAE) of 39.28 mg/dL and a mean error (ME) of -10.62 mg/dL, which represents a reduction in bias by 13.89 mg/dL compared to the recursive LSTM and a 28.62 mg/dL improvement over the ODE-based digital twin.
Implications
The successful integration of physiological modeling with deep learning for glucose forecasting suggests that similar hybrid approaches could be beneficial in other medical and health-related domains where mechanistic models face challenges in individual variability. This could lead to more personalized and effective treatment strategies for chronic conditions.
Learning Variable-Length Tokenization for Generative Recommendation
Generative Models
Optimization
Theory
- Introduction of the Popularity-Length Paradox in generative recommendation.
- Development of VarLenRec framework for variable-length tokenization.
- Use of PIBA for optimal identifier length allocation based on item popularity.
- Implementation of Hyperbolic Residual Quantization to manage diverse code lengths.
Read more
Learning Variable-Length Tokenization for Generative Recommendation
Summary
This paper addresses a significant limitation in generative recommendation systems, which traditionally use fixed-length tokenization for item identifiers, ignoring the varying needs based on item popularity. The authors identify the Popularity-Length Paradox, where popular items perform better with shorter IDs due to abundant collaborative signals, while less popular items (tail items) require longer IDs to capture necessary semantic details. To tackle this issue, they propose VarLenRec, a framework that learns variable-length tokenization tailored to item popularity. The authors introduce Popularity-Weighted Information Budget Allocation (PIBA) to allocate identifier lengths optimally and address technical challenges such as the limitations of Euclidean residual quantization and the non-differentiability of length decisions. They implement Hyperbolic Residual Quantization to stratify encoding capacity and a Soft Length Controller to enable differentiable length prediction. Extensive experiments across four datasets demonstrate that VarLenRec significantly outperforms state-of-the-art methods in recommendation accuracy and enhances training and inference efficiency, emphasizing the importance of adaptive encoding in generative recommendation systems.
Methodology
The authors conducted systematic experiments to analyze the relationship between item popularity and optimal ID length, leading to the development of VarLenRec. They utilized PIBA for theoretical guidance on ID length allocation and addressed technical challenges through Hyperbolic Residual Quantization and a Soft Length Controller for differentiable predictions.
Results
VarLenRec achieved substantial improvements in recommendation accuracy across four datasets compared to existing methods, demonstrating the effectiveness of variable-length tokenization in capturing the nuances of item popularity and interaction data.
Implications
The findings suggest that adaptive encoding strategies can enhance the performance of generative recommendation systems, potentially leading to more personalized and effective user experiences in various applications such as e-commerce and content recommendation.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
NLP
Large Language Models
Optimization
- GCSL enables direct training from feedback signals without the need for external reward models or paired preference data.
- The novel goal-achieving objective allows for consistent improvement beyond the average quality of selected training subsets.
- Natural-language goal representations enhance the model's ability to utilize its semantic understanding.
- The approach shows improved performance across multiple tasks compared to traditional offline fine-tuning methods.
Read more
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
Summary
This paper introduces Goal-Conditioned Supervised Learning (GCSL) as a novel offline fine-tuning framework for large language models (LLMs). Traditional fine-tuning methods, including online reinforcement learning (RL) and offline supervised fine-tuning (SFT), face limitations such as reliance on external reward models or binary feedback that discards nuanced information. GCSL reframes feedback signals as explicit goals, allowing the model to be trained through supervised learning to generate responses that meet these goals. A key innovation is the introduction of a goal-achieving objective that encourages the model to consistently pursue outcomes above a specified quality threshold, thereby overcoming the bounded-learning effects seen in SFT. The authors also propose natural-language goal representations to leverage the semantic capabilities of LLMs. The method is evaluated across three tasks: non-toxic generation, code generation, and LLM for recommendation, demonstrating superior performance compared to standard offline fine-tuning approaches while maintaining efficiency and scalability.
Methodology
The authors propose a goal-conditioned supervised learning framework that treats feedback as explicit goals. They introduce a goal-achieving objective that focuses on consistently pursuing outcomes above a quality threshold, and utilize natural-language representations of goals to better leverage the LLM's semantic capabilities.
Results
The GCSL method consistently outperformed standard offline fine-tuning baselines in tasks such as non-toxic generation, code generation, and LLM for recommendation, demonstrating its effectiveness and efficiency.
Implications
The proposed GCSL framework could significantly enhance the fine-tuning process of LLMs in various applications, making it easier to align model behavior with user intent while reducing the costs and complexities associated with traditional methods.
Exact Linear Attention
NLP
Large Language Models
Efficient ML
- Introduces Exact Linear Attention (ELA) with linear computational complexity.
- Addresses gradient explosion and token attention dilution through kernel constraints.
- Proposes innovative structures like Hyper-Link and Memory Lobe for improved performance.
- Demonstrates significant improvements in decoding speed and memory efficiency.
Read more
Exact Linear Attention
Summary
This paper presents Exact Linear Attention (ELA), a novel attention mechanism designed to achieve linear computational complexity for Transformer models by utilizing the exact decomposition properties of kernel functions. Unlike previous linear attention methods that suffered from issues like gradient explosion and token attention dilution, ELA imposes kernel constraints to ensure non-negativity, discriminability, and geometric interpretability. The author introduces several kernel functions, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each optimized for specific attention behaviors. Additionally, three engineering innovations are proposed: a Hyper-Link structure to replace traditional residual connections, a Memory Lobe module for capturing transformation flow across layers, and a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to enhance interpretability. Experimental results indicate that ELA can achieve up to 6Γ faster decoding speeds and a 75% reduction in KV cache memory usage compared to full attention, while maintaining or improving training performance. The memory module also accelerates convergence and enhances generalization, indicating a promising approach for scaling Transformers to handle ultra-long sequences.
Methodology
The methodology involves the formulation of Exact Linear Attention using kernel functions that meet specific criteria for decomposability, discriminability, non-negativity, and geometric interpretability. The paper systematically analyzes the limitations of existing linear attention methods and proposes new kernel functions tailored for different attention behaviors. Additionally, engineering innovations are introduced to enhance the architecture's performance.
Results
Experimental results show that ELA achieves up to 6Γ faster decoding speeds and reduces KV cache memory usage by 75% compared to traditional full attention mechanisms. The proposed memory module enhances convergence rates and generalization capabilities, making ELA a viable solution for processing ultra-long sequences.
Implications
The findings suggest that Exact Linear Attention can significantly improve the efficiency and scalability of Transformer models, making them more suitable for applications involving long sequences, such as natural language processing tasks and large-scale data processing.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
Large Language Models
Efficient ML
Optimization
- Introduces Spherical KV, a new inference primitive for efficient long-context decoding.
- Frames KV memory as a rate-distortion allocation problem, focusing on directional attention.
- Achieves 1.55x to 1.72x higher throughput while reducing resident KV bytes/token by 24-42%.
- Demonstrates improved performance in high-stress scenarios with long contexts.
Read more
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
Summary
The paper addresses the challenges of long-context decoding in large language models (LLMs), particularly focusing on the limitations imposed by KV cache growth, High Bandwidth Memory (HBM) bandwidth, and peak memory. The authors propose a novel framework called Spherical KV, which integrates Angle-Domain Attention and Rate-Distortion Retention to optimize memory usage while maintaining quality and speed. The approach redefines KV memory as a rate-distortion allocation problem, emphasizing the importance of directional attention and efficient bit allocation for tokens. By storing keys in a spherical parameterization, the method reduces the need for dense-vector materialization, thus minimizing HBM traffic. The results demonstrate that Spherical KV achieves significant improvements in throughput and memory efficiency, particularly in high-stress scenarios with long contexts, while preserving quality. This work not only advances the efficiency of long-context inference but also provides insights into the mechanisms of attention in LLMs.
Methodology
The authors developed Spherical KV by coupling Angle-Domain Attention with Rate-Distortion Retention. This involves storing keys in a spherical format, allowing for efficient computation of attention logits without dense-vector materialization. A lightweight controller manages retention and bitwidth allocation per token, optimizing memory usage while ensuring quality.
Results
Spherical KV consistently outperformed traditional methods, achieving 55-72% faster token generation rates across various context lengths (8K, 32K, 128K) while simultaneously reducing the memory footprint of KV caches. The method's effectiveness was particularly pronounced in scenarios with long contexts, where traditional methods struggle with paging and fragmentation.
Implications
The findings suggest that Spherical KV can significantly enhance the efficiency of LLMs in production environments, particularly for applications requiring long-context inference. The approach may lead to more scalable and faster deployments of LLMs in real-world applications, improving user experience and resource utilization.
ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Time Series
- Introduces a unified token-sequence framework for irregular clinical time series that preserves observation structure and missingness patterns.
- Develops a reliability-aware temporal aggregation mechanism that estimates observation validity based on missingness and elapsed time.
- Utilizes Chronological Weaving for multi-scale sequence modeling, allowing integration of information from different temporal resolutions.
- Implements budgeted token routing to manage sequence length while retaining informative summaries.
Read more
ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Summary
The paper presents ReTAMamba, a novel framework designed to improve the prediction of irregular clinical time series data, which are characterized by irregular sampling, missing values, and heterogeneous observation patterns. Traditional methods often fail to adequately capture the complexities of clinical data, particularly the decaying reliability of past observations and the need for coherent temporal aggregation. ReTAMamba addresses these issues by reconstructing clinical time series as time-variable token sequences, estimating observation reliability based on missingness and elapsed time, and enhancing interval summaries with statistical descriptors. The framework employs a technique called Chronological Weaving to integrate short- and long-term temporal information while maintaining a coherent temporal context. Additionally, a budgeted token router is utilized to manage sequence length while preserving informative summaries. Experimental results on datasets such as MIMIC-IV, eICU, and PhysioNet 2012 demonstrate that ReTAMamba significantly outperforms existing baselines, achieving average relative gains in AUPRC of 7.51%, 7.80%, and 10.15%, respectively. The findings indicate that effective prediction in irregular clinical time series requires not only modeling the observed values but also considering the timing and reliability of those observations.
Methodology
ReTAMamba employs a token-sequence framework to represent irregular clinical time series, estimating observation reliability from missingness and elapsed time. It integrates short- and long-term temporal information through Chronological Weaving and applies budgeted token routing to control sequence length while preserving essential information.
Results
The framework consistently improves AUPRC metrics over strong baselines, with average relative gains of 7.51%, 7.80%, and 10.15% on MIMIC-IV, eICU, and PhysioNet 2012 datasets, respectively. Additionally, cohort-level analyses reveal that the decay in reliability for dynamic signals is significantly greater than for static signals.
Implications
ReTAMamba's approach to modeling irregular clinical time series can enhance predictive accuracy in critical healthcare settings, potentially improving patient outcomes through better risk assessment and clinical decision support.
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
Large Language Models
Multimodal
- Proposes a multi-stage framework for checkpoint selection that integrates real-world data and structured evaluations.
- Introduces subsampling-based confidence estimation to enhance reliability in ranking checkpoints.
- Highlights the critical role of data quality, particularly OCR readability, in evaluation validity.
- Critiques existing evaluation methods for their lack of robustness and alignment with real-world performance.
Read more
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
Summary
This paper addresses the challenges of checkpoint selection for multimodal large language models (MLLMs), particularly in scenarios where performance differences are marginal and evaluation signals are noisy. The authors critique existing methodologies that rely heavily on static benchmarks or pointwise scoring, which often misalign with real-world usage and lack robust uncertainty estimation. To overcome these limitations, they propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. This framework includes pointwise filtering, listwise ranking, and pairwise comparison to enhance the reliability of checkpoint selection. A key innovation is the introduction of subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. The authors also highlight the importance of data quality, specifically OCR readability, as a critical factor in evaluation validity. Overall, the proposed approach aims to improve robustness in checkpoint selection under evaluation uncertainty, making it more aligned with practical usage scenarios.
Methodology
The authors developed a multi-stage evaluation framework that combines pointwise scoring for initial filtering with listwise and pairwise comparisons for refined ranking. They utilized curated real-world data to better reflect practical model performance and introduced subsampling techniques to estimate the robustness of checkpoint rankings.
Results
The proposed methodology showed improved reliability in checkpoint selection compared to traditional pointwise scoring methods. The integration of real-world data and advanced ranking techniques resulted in more stable and accurate evaluations, particularly in scenarios with marginal performance differences.
Implications
This research has significant implications for the development and evaluation of multimodal large language models, particularly in applications where performance consistency and reliability are critical. The findings suggest that adopting robust evaluation frameworks can lead to better model selection and ultimately enhance user experience in real-world applications.
DCFold: Efficient Protein Structure Generation with Single Forward Pass
Generative Models
Efficient ML
- DCFold achieves AlphaFold3-level accuracy with a single-step generative model.
- The Dual Consistency framework eliminates iterative overhead, enhancing efficiency.
- Temporal Geodesic Matching (TGM) stabilizes training and improves performance for variable-length protein sequences.
- DCFold demonstrates a 15Γ speedup in inference time compared to AlphaFold3.
Read more
DCFold: Efficient Protein Structure Generation with Single Forward Pass
Summary
The paper presents DCFold, a novel generative model for protein structure prediction that achieves AlphaFold3-level accuracy while significantly reducing inference time. The authors identify the limitations of AlphaFold3's iterative design, which increases computational overhead and limits practical applications in fields like virtual screening and protein design. To address these issues, DCFold employs a Dual Consistency training framework along with a Temporal Geodesic Matching (TGM) scheduler, allowing for a single-step prediction process. This approach not only maintains predictive fidelity but also accelerates inference by 15 times compared to AlphaFold3. The effectiveness of DCFold is validated through extensive benchmarks in both structure prediction and binder design tasks, demonstrating its potential as a foundational model for various applications in computational biology.
Methodology
The authors propose DCFold, which utilizes a Dual Consistency training framework to enforce both Pairformer and Diffusion Consistency. A novel Temporal Geodesic Matching (TGM) scheduler is introduced to adaptively pair timesteps, addressing the challenges of variable protein sequence lengths and stabilizing training dynamics. This allows DCFold to perform single-step predictions without the iterative overhead present in previous models.
Results
DCFold achieves comparable accuracy to AlphaFold3 on benchmarks such as Posebusters V2 and Recent PDB while achieving a 15Γ reduction in inference time. In practical applications like binder design, DCFold improves the success rate of in silico screening by enabling faster and more reliable candidate evaluations.
Implications
The advancements presented in DCFold could significantly enhance the efficiency of protein structure prediction and design workflows, making high-throughput applications more feasible in computational biology. Its lightweight architecture and rapid inference capabilities may facilitate broader adoption of generative models in various biological research and drug discovery settings.
An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making
Time Series
- Developed a multi-horizon forecasting framework for ED boarding time.
- Utilized real-world data and integrated external contextual factors for improved predictions.
- Demonstrated superior performance of deep learning models in forecasting boarding times.
- Created an MLOps web application to support practical implementation of the forecasting framework.
Read more
An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making
Summary
This paper addresses the critical issue of overcrowding in emergency departments (ED), specifically focusing on the boarding time of admitted patients awaiting inpatient beds. The authors developed a multi-horizon time series forecasting framework that predicts ED boarding time at various intervals (6, 8, 10, 12, and 24 hours) using real-world data from a university-affiliated urban hospital in the U.S. The framework integrates external contextual data such as weather, holidays, and local events to enhance prediction accuracy. Two deep learning models, Decomposition-based Linear (DLinear) and Normalization-based Linear (NLinear), were employed and demonstrated superior performance in forecasting boarding times, particularly under extreme congestion scenarios. Additionally, the authors created a Machine Learning Operations (MLOps) web application prototype to facilitate the practical application of their forecasting framework, enabling data ingestion, forecast visualization, experimentation, and model retraining. This integrated approach aims to transition from reactive to proactive operational decision-making in EDs, ultimately improving patient care and resource management.
Methodology
The study employed a multi-horizon time series forecasting framework using deep learning models (DLinear and NLinear) to predict ED boarding times. Real-world data from a university-affiliated hospital was combined with external contextual data sources to enhance prediction accuracy. The models were evaluated under various scenarios, including extreme congestion conditions.
Results
The deep learning models showed superior performance in predicting ED boarding times across multiple horizons. The integration of external contextual factors significantly improved the accuracy of the forecasts. The MLOps web application prototype was successfully developed to facilitate the practical application of the forecasting framework.
Implications
The findings suggest that predictive modeling can significantly enhance operational decision-making in emergency departments, allowing for proactive measures to mitigate overcrowding and improve patient care. The integrated MLOps application provides a framework for hospitals to implement these predictive models effectively.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models
Federated Learning
Efficient ML
Optimization
- UB-SMoE addresses expert utilization imbalance and non-differentiability issues in federated learning.
- Dynamic Modulated Routing (DMR) and Universal Pseudo-Gradient (PG) are introduced to enhance expert viability.
- The method significantly reduces computational costs for low-resource clients while improving their performance.
- Experimental results show UB-SMoE outperforms existing heterogeneous LoRA-rank methods.
Read more
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models
Summary
The paper introduces Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), a novel approach to federated fine-tuning of foundation models that addresses the challenges posed by heterogeneous client capabilities. Existing methods, particularly those based on Low-Rank Adaptation (LoRA), struggle to provide significant computational savings for resource-constrained clients due to the dominance of dense feed-forward computations. The authors identify two critical issues in applying Sparse Mixture-of-Experts (SMoE) in heterogeneous settings: expert utilization imbalance and non-differentiability of Top-K routing, which lead to degraded convergence for low-resource clients. To mitigate these issues, UB-SMoE employs Dynamic Modulated Routing (DMR) to rebalance expert utilization and Universal Pseudo-Gradient (PG) to provide learning signals for non-activated experts. This self-reinforcing mechanism ensures that all experts remain viable across different client capabilities. Experimental results demonstrate that UB-SMoE achieves up to 45.0% computational reduction for low-resource clients while improving their performance by 8.7 times compared to existing heterogeneous LoRA-rank methods.
Methodology
The authors propose UB-SMoE, which incorporates two key mechanisms: Dynamic Modulated Routing (DMR) to adjust expert selection probabilities based on global utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. This approach creates a self-reinforcing cycle that maintains expert viability across heterogeneous clients.
Results
UB-SMoE achieves up to 45.0% computational reduction for low-resource clients and improves their performance by 8.7 times compared to existing methods. The convergence analysis indicates that the proposed mechanisms effectively mitigate the identified discordances in expert utilization and routing.
Implications
The findings suggest that UB-SMoE can enhance the efficiency and effectiveness of federated learning systems, particularly in environments with diverse client capabilities. This has potential applications in various domains where resource constraints are a concern, such as mobile devices and edge computing.
CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
Graph Learning
- CAMERA addresses the issue of semantic camouflage in unsupervised TAGFD.
- The framework utilizes a mixture-of-experts architecture to model diverse fraud-indicative cues.
- A context-informed gating model allows for adaptive integration of cues from different experts.
- CAMERA supports unsupervised one-class learning by focusing on benign patterns.
Read more
CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
Summary
The paper introduces CAMERA, a novel framework designed to enhance unsupervised text-attributed graph fraud detection (TAGFD) by addressing the challenge of semantic camouflage employed by fraudsters. Traditional TAGFD methods struggle against evolved fraudsters who mimic benign user behavior through linguistic camouflage, making detection difficult. CAMERA employs a Case-Adaptive Multi-cue Expert framework, utilizing an ego-decoupled mixture-of-experts architecture where each expert specializes in different fraud-indicative cues. A context-informed gating model is integrated to adaptively combine these cues based on the ego node and its local neighborhood context. Furthermore, CAMERA leverages the rarity of fraudsters to facilitate unsupervised one-class learning, focusing on modeling dominant benign patterns to improve detection accuracy. Experimental results across four challenging datasets demonstrate that CAMERA consistently outperforms existing methods, showcasing its effectiveness in identifying camouflaged fraudsters without the need for labeled data.
Methodology
CAMERA employs an ego-decoupled mixture-of-experts architecture where each expert is trained to recognize different types of fraud-indicative cues. A context-informed gating model is used to adaptively integrate these cues based on the local neighborhood context of the nodes. The framework also incorporates unsupervised one-class learning techniques to model benign behavior patterns, leveraging the inherent rarity of fraudsters.
Results
The experiments conducted on four challenging datasets indicate that CAMERA consistently outperforms existing unsupervised TAGFD methods, demonstrating its robustness against fraudsters employing semantic camouflage. The results highlight the effectiveness of the adaptive integration of multiple cues in improving detection accuracy.
Implications
The findings suggest that CAMERA can be effectively applied in real-world scenarios involving online social and e-commerce platforms, where detecting fraudulent activities is crucial. The framework's ability to operate without labeled data makes it particularly valuable in practical applications where obtaining annotations is challenging.
Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting
Time Series
- KUP-BI introduces a bidirectional forecasting approach that utilizes post-target continuation information.
- The framework distills continuation-style auxiliary features from historical data to enhance forecasting models.
- KUP-BI can be integrated into existing forecasting backbones with minimal overhead.
- Experimental results show consistent improvements in forecasting performance across multiple datasets.
Read more
Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting
Summary
This paper addresses the limitations of traditional time-series forecasting methods that primarily rely on one-way inference from historical data to target predictions. The authors introduce a novel framework called KUP-BI (Knowledge Utilization Paradigm with Bidirectional Inspiration), which incorporates structural information from the post-target continuation of time-series data. By leveraging a train-only historical library, KUP-BI distills continuation-style knowledge that serves as an auxiliary input to enhance forecasting accuracy. The framework integrates this auxiliary stream into existing forecasting models using a lightweight feature-level gating module, allowing the model to utilize both historical data and typical continuation patterns. Experimental evaluations on six public datasets demonstrate that KUP-BI consistently outperforms state-of-the-art forecasting models, achieving improved accuracy with minimal additional computational overhead. This approach not only stabilizes forecasts but also provides a more robust framework for long-horizon predictions by utilizing the full history-target-post-target continuation chain present in the training data.
Methodology
The KUP-BI framework constructs a train-only library from history-target-post-target continuation chains. For a new input, it retrieves similar historical data, aggregates their continuation transformations, and fuses this information with the current input using a gating module. This method allows the model to leverage both local input features and broader continuation patterns derived from historical data.
Results
KUP-BI was tested on six public datasets, where it consistently improved the forecasting performance of various state-of-the-art models. The results indicate that the incorporation of continuation-style auxiliary features leads to more stable and accurate long-term predictions.
Implications
The findings suggest that integrating post-target continuation information can significantly enhance the reliability of time-series forecasting models. This approach has potential applications in various fields, including finance, transportation, and public health, where accurate long-term predictions are critical.
The Symmetries of Three-Layer ReLU Networks
Theory
Optimization
- Developed a systematic framework for analyzing symmetries in three-layer ReLU networks.
- Characterized layerwise symmetries and described fibers using polynomial equations.
- Identified new symmetries from layer composition that introduce additional redundancies.
- Showed that symmetries in deep networks are not always localizable, impacting parameter redundancy.
Read more
The Symmetries of Three-Layer ReLU Networks
Summary
This paper presents a comprehensive framework for analyzing parameter symmetries in three-layer ReLU neural networks, focusing on the characterization of generic parameter fibers. The authors provide explicit semi-algebraic descriptions of these fibers and develop a polynomial time algorithm to determine the functional equivalence of parameters. The study identifies both discrete and continuous symmetries arising from layer composition, revealing that deeper layers can either obscure or maintain geometric structures from preceding layers. Notably, the paper demonstrates that certain symmetries lead to local conservation laws along gradient flow, while others do not. The findings contribute to a deeper understanding of parameter redundancy and optimization landscapes in deep ReLU networks, addressing gaps in the existing literature regarding the structure of fibers in multi-layer architectures.
Methodology
The authors analyze two-layer networks to characterize symmetries within layers, then extend this analysis to three-layer networks to identify new symmetries from layer composition. They provide explicit descriptions of parameter fibers through polynomial equations and inequalities, and compute the dimensions of these fibers. The study also constructs examples to illustrate the non-localizability of fibers.
Results
The paper successfully characterizes the symmetries and fibers of three-layer bottleneck ReLU networks, demonstrating that these fibers depend solely on weight signs. The authors provide a polynomial time algorithm for deciding functional equivalence of parameters and show that certain symmetries induce local conservation laws along gradient flow, while others do not.
Implications
The findings have significant implications for understanding identifiability and optimization in deep learning models, particularly in ReLU networks. By elucidating the structure of parameter fibers and their symmetries, this work can inform better training strategies and model designs that leverage these insights to improve performance and efficiency.
DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data
Time Series
Generative Models
Reinforcement Learning
- DAD4TS optimizes data augmentation specifically for small-scale time-series data.
- The framework introduces a Selector that evaluates and retains only the most informative generated samples.
- Joint training of the TSF model, data generator, and Selector allows for dynamic adaptation to the forecasting task.
- DAD4TS is architecture-agnostic, making it a versatile extension for various TSF models.
Read more
DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data
Summary
The paper addresses the challenge of time-series forecasting (TSF) with small-scale data, where traditional data augmentation methods often fail to generate meaningful samples. The authors propose DAD4TS, a novel data augmentation framework that utilizes a diffusion model combined with reinforcement learning to enhance forecasting accuracy. DAD4TS simultaneously trains a data generator and a TSF model, optimizing the generation of samples that directly improve forecasting performance. The framework employs a Selector that evaluates the utility of generated samples based on their contribution to the TSF model's performance, allowing for dynamic and informed data generation. This approach is validated through extensive experiments on six real-world datasets, demonstrating significant improvements in forecasting accuracy compared to existing methods. The results indicate that DAD4TS effectively addresses the limitations of small-scale data in TSF tasks, providing a robust solution for generating informative samples that enhance model performance.
Methodology
DAD4TS employs a diffusion model for data generation, which is trained alongside a TSF model. The framework incorporates reinforcement learning to guide the Selector in determining the most valuable generated samples based on their impact on forecasting performance. This joint training approach allows for continuous adaptation and optimization of the data generation process.
Results
The experiments conducted on six real-world datasets demonstrated that DAD4TS outperformed seven comparative methods, achieving significant improvements in forecasting accuracy. The framework was validated on five datasets, showcasing its effectiveness in enhancing TSF performance under limited data conditions.
Implications
The proposed DAD4TS framework has the potential to improve time-series forecasting in various applications, such as finance, healthcare, and supply chain management, where data scarcity is a common issue. Its architecture-agnostic nature allows for easy integration into existing forecasting systems, making it a valuable tool for practitioners.
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Large Language Models
NLP
Efficient ML
- TwinRouterBench provides a step-level evaluation framework for LLM routing, addressing limitations of existing benchmarks.
- The static track offers 970 execution-verified labels for realistic routing supervision, facilitating rapid offline development.
- The dynamic track validates routing decisions in real-time, measuring success based on official task resolution and API costs.
- The benchmark supports a two-track development loop, allowing for both training and live execution validation.
Read more
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Summary
The paper introduces TwinRouterBench, a novel benchmark designed to evaluate LLM routing in multi-turn applications where each user request can trigger multiple model calls. Unlike existing benchmarks that assess routers based on isolated one-shot prompts, TwinRouterBench evaluates routing decisions at each step, taking into account the full context of previous interactions. The benchmark consists of two tracks: a static track with 970 router-visible prefixes from various workloads, and a dynamic track that validates routers in real-time using a comprehensive 500-case suite. The static track allows for rapid offline development with deterministic scoring, while the dynamic track ensures that routing decisions are effective in live scenarios. This dual approach enables a clear distinction between offline training and end-to-end validation, ultimately improving the cost-effectiveness and success rates of LLM routing in practical applications.
Methodology
The methodology involves creating a benchmark with two distinct tracks: a static track for offline router development featuring execution-verified labels, and a dynamic track for live validation using a locked pool of models. The static track employs deterministic scoring based on tier labels and token costs, while the dynamic track evaluates the effectiveness of routing decisions in real-time scenarios.
Results
The paper reports that a logistic router trained on the static labels achieved the best cost-success trade-off among evaluated policies. The dynamic track demonstrated the effectiveness of routing decisions in a live environment, ensuring that cheaper model selections did not compromise task success.
Implications
TwinRouterBench has significant implications for the development of cost-effective LLM routing in applications such as coding agents, research systems, and other multi-turn interactions. By providing a robust evaluation framework, it can enhance the efficiency and effectiveness of model selection processes in real-world scenarios.